ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
10 stars 7 forks source link

Scoping not being applied correctly. #15

Closed anjackson closed 5 years ago

anjackson commented 6 years ago

The current design scopes prior to enqueueing the discovered URLs in the 'to-crawl' queue. This will not work as expected currently, as when running distributed, each H3 engine only has the scope configuration for the seeds passed to it.

The better plan is to enqueue all discovered URLs and let the receiver do the scoping. We could use a topic naming convention to manage these streams:

The receiver would subscribe to uris.requested and uris.discovered in the current design.

Additionally, it would be good to work out how to modify the candidate chain to redirect the out-of-scope URLs to a dedicated stream.

anjackson commented 5 years ago

Having run under this model, it turned out to be hard to scale effectively. Routing alread-seen and out-of-scope URLs via Kafka lead to ABSOLUTELY MASSIVE QUEUES, which made the crawler very slow to resume.

In practice we need all workers to know the full scope (either replicating the scope or by consulting a 'Scope Oracle') and perform the already-seen check (to drop duplicates).