ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Switch to a 'Scope Oracle' model #37

Open anjackson opened 5 years ago

anjackson commented 5 years ago

Currently, we need to replicate the current crawl scope(s) in multiple places. Managing and maintaining the scope becomes rather cumbersome under distributed crawling, and we could also do consulting the scope from the access side as well as during crawling and as part of W3ACT.

In principle, we could have a distinct REST API service that held the current crawl scopes (NPLD and BY-PERMISSION). W3ACT would consult it, and changes in W3ACT would change it there. All the crawlers would consult it rather than have their own. A single replica service could be used for access/frontend services as needed.

Of course, this is quite a big bit of work, so recording it here, but putting it on the back-burner for a while. Need to settle in with the current new model first!