ukwa / ukwa-access-api

An application to wrap up APIs for accessing UKWA content.
Apache License 2.0
1 stars 2 forks source link

Scope queries via API #8

Open anjackson opened 4 years ago

anjackson commented 4 years ago

For external parties to know which URLs we can crawl, and hence what is worth posting to the save endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.

Essentially, GET /in-scope?url=http://test.url returns true/false.

n.b. this is similar to: https://github.com/ukwa/ukwa-heritrix/issues/37

anjackson commented 4 years ago

This Python Trie implementation would make a good 'backbone' for this kind of scope Oracle. Given a URL, and using urlcanon to generate the SSURTs, it can find matching prefixes and use that to map URLs to scope rules.

anjackson commented 3 years ago

This seems better for Tries https://pypi.org/project/datrie/

https://pypi.org/project/urlcanon/