Open anjackson opened 4 years ago
This Python Trie implementation would make a good 'backbone' for this kind of scope Oracle. Given a URL, and using urlcanon
to generate the SSURTs, it can find matching prefixes and use that to map URLs to scope rules.
This seems better for Tries https://pypi.org/project/datrie/
For external parties to know which URLs we can crawl, and hence what is worth posting to the
save
endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.Essentially,
GET /in-scope?url=http://test.url
returnstrue/false
.n.b. this is similar to: https://github.com/ukwa/ukwa-heritrix/issues/37