Scoping not being applied correctly.

The current design scopes prior to enqueueing the discovered URLs in the 'to-crawl' queue. This will not work as expected currently, as when running distributed, each H3 engine only has the scope configuration for the seeds passed to it.

The better plan is to enqueue all discovered URLs and let the receiver do the scoping. We could use a topic naming convention to manage these streams:

uris.requested (where crawl launch requests go)
uris.discovered (where all discovered URIs go)
uris.discarded (where out-of-scope or otherwise discarded URIs go)
uris.to.crawl (where in-scope URIs go, if we were to run the scoper as a separate process)

The receiver would subscribe to uris.requested and uris.discovered in the current design.

Additionally, it would be good to work out how to modify the candidate chain to redirect the out-of-scope URLs to a dedicated stream.

ukwa / ukwa-heritrix

Scoping not being applied correctly. #15