Synchronise crawl scope across all Heritrix workers

anjackson commented 5 years ago

We have experimented with leaving all discovered URLs in the tocrawl stream. This will probably work okay for the frequent crawls, but due to the large number of out-of-scope URLs and duplicates, the queue rapidly becomes far too large in the domain crawl.

We could attempt to just deduplicate the tocrawl stream, but this could conflict with the intentional recrawling of URLs. In the current model, the simplest approach is to apply the full set of scope rules before emitting the discovered URLs.

The problem here is that, due to the way work is distributed across the H3 instances, the crawl scopes are inconsistent across nodes. For example, if we launch a crawl and mark a URL as a seed, then the instance that crawls that host will add the URL to widen it's scope. The other nodes don't see the seed so don't do this, which in turn means if one of those other nodes discovers URLs on that host, it will erroneously discard them from the crawl.

In the very short term, for the domain crawl, the scope can be fixed for the duration of the crawl.

For dynamic crawls, to keep the scope consistent across the nodes, it would probably make most sense for the scope to be held outside the nodes, in a remote database. However, that's a fairly big leap from where we are right now, in terms of crawl life-cycle management and because it means adding yet another component to the system.

An alternative strategy would be to add a KafkaCrawlConfigReceiver, running on every worker, each reading the same single-partition crawl.config queue. When the current KafkaUrlReceiver picks up a seed, it could post a message to the crawl.config queue, then handle the seed as normal. The KafkaCrawlConfigReceiver instances would then pick up this message and grow the scope as required, without enqueueing the URL (i.e. by modifying the DecideRule, via an autowired connection).

This avoids adding any new external system, and ensures crawl launch is still a single action, but does not cope well when we want to remove a site from the crawl scope.

The simplest shared external component would be a WatchedSurtFile. This could be updated externally, or from the crawl, and could be re-read quickly. The main constraint is that it has to be held outside of the job folder, so it can be cross-mounted and made available for every node.

Having tested this, it seems to work fine - we can mount an alternative path to a SURT file and it gets reloaded. For the frequent crawl, we can also get a Luigi job to re-create this file periodically. This seems the simplest option, and should work well as a shared file distributed via GlusterFS.

anjackson commented 5 years ago

This relies on the fact that short file appends are atomic, which should work okay with GlusterFS as long as write behind is not in use.

anjackson commented 5 years ago

Reading around a bit more, it seems Gluster should be fine if fnctl/POSIX locks are used, which the Java FileLock implementation should honour.

anjackson commented 5 years ago

Added a suitable lock in 4fff329420cc456cb2d1d440f8fc27eb1b96fa39

anjackson commented 5 years ago

Looks good.

ukwa / ukwa-heritrix

Synchronise crawl scope across all Heritrix workers #17