Closed anjackson closed 5 years ago
Having run under this model, it turned out to be hard to scale effectively. Routing alread-seen and out-of-scope URLs via Kafka lead to ABSOLUTELY MASSIVE QUEUES, which made the crawler very slow to resume.
In practice we need all workers to know the full scope (either replicating the scope or by consulting a 'Scope Oracle') and perform the already-seen check (to drop duplicates).
The current design scopes prior to enqueueing the discovered URLs in the 'to-crawl' queue. This will not work as expected currently, as when running distributed, each H3 engine only has the scope configuration for the seeds passed to it.
The better plan is to enqueue all discovered URLs and let the receiver do the scoping. We could use a topic naming convention to manage these streams:
uris.requested
(where crawl launch requests go)uris.discovered
(where all discovered URIs go)uris.discarded
(where out-of-scope or otherwise discarded URIs go)uris.to.crawl
(where in-scope URIs go, if we were to run the scoper as a separate process)The receiver would subscribe to
uris.requested
anduris.discovered
in the current design.Additionally, it would be good to work out how to modify the candidate chain to redirect the out-of-scope URLs to a dedicated stream.