Allow separate Requested/Discovered/Accepted URI streams?

anjackson commented 5 years ago

Should we allow the launch requests to be store in a separate topic/log/stream to the URIs the log of discovered URLs?

To make things faster, when running a single crawler, we would directly enqueue all discovered URIs and log them to a stream that would only be used when we needed to rebuild the frontier. This would also help alleviate the problems we've seen when pause/unpause or restart the crawler and it rewinds to the start of the Launch queue when it should not (but that's really a bug in the Kafka client so we should really resolve that).

It does, however, complicate things if we still want to route discovered URIs via the stream rather than directly enqueuing them, e.g. for distributed crawling or for when using streams to send different requests to different crawl processes. In that case, we need to allow the receiver to listen to two streams, and re-enable the redirection via the stream, and put this all behind a single configuration option.

anjackson commented 5 years ago

This should work well, and can be adapted to keep the distributed crawl URL routing separate from the crawl requests/launch stream.

However, in general we've found routing the raw discovered.uris stream is not really practical. Even for the frequent crawl it contains so many duplicates that it balloons everything and the overheads get too large. It is workable in a heavily distributed setup, e.g. if we ran a separate 'swarm' of Scopers.

Our original use case was simply to back-up the contents of the Frontier, so we can re-construct it if the Heritrix state gets corrupted. For that, it is sufficient to stream out the accepted.uris, i.e. those that made it though the scope rules and were enqueued.

For the domain crawl, reading the discovered queue was very slow. An alternative is to allow scoping to happen locally, but re-route to a different crawler on the Candidates Chain (as the traditional HashCrawlMapper did). This could also be done at the start of the Fetch Chain, so that the overall system can cope if the number of members changes.

However, this only works if we can set up the routing so each crawler knows which URL range it owns, using the same routing keys that the routing mechanism uses (i.e. Kafka partitions). We currently explicitly manage partition assignment, but it would be preferable to rely on Kafka's built-in mechanisms and allow dynamic re-scaling.

Note that separating requested and discovered URL handling might need some additional logic to make sure seed URLs don't get crawled by every node.

URI Streams:

Requested (a.k.a. launched, to-crawl)
Discovered (a.k.a. candidates)
Accepted (a.k.a. enqueued), and also Discarded (for those that didn't get accepted).
Due (not really a stream in the same sense, as due URIs are emitted when crawl capacity allows)
Crawled

See here for source

In abstract terms, the Requested and the Discovered are processed by the Scoper to generate the Accepted (the former can modify the scope, the latter cannot). The Accepted are prioritized into the Frontier. The Frontier emits the URIs that are due to be crawled, and the crawler threads (Toe Threads in Heritrix) do the work and emit the Crawled stream.

The discovered.uris stream would likely become more manageable if we could effectively de-duplicate it. However, this has to take into account that re-crawls are allowed, so it can't be too aggressive. This needs a very large cache or a discovered-uri database. For larger crawls, it's likely easier to leave as-is and run lots of Scopers.

anjackson commented 5 years ago

Right I think I understand how to make this work. We can keep track of which partitions are assigned to which worker, and then in both the Candidate Chain and at the start of the Fetch Chain we have a processor that checks whether the CrawlURI should be handled by this crawl worker If not, it's routed out to a accepted.uris.distribute topic which will route those URIs to the crawl worker that will handle it and which can pop is straight into the frontier. This will avoid routing all candidates, as all will have been scoped locally first. It will also cope if the workers are changed, as partitions that have been assigned elsewhere will have their URIs re-routed at the start of the Fetch Chain.

The main problem with this approach is that when URIs are NOT re-routed they will only be stored in the Frontier. i.e. this model does not 'back-up' the contents of the Frontier. So, if needed, that would have to be part of a wholly separate accepted.uris topic.

ukwa / ukwa-heritrix

Allow separate Requested/Discovered/Accepted URI streams? #23