ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Add white-list/black-list support #13

Open anjackson opened 6 years ago

anjackson commented 6 years ago

The W3ACT definition includes the idea of regular-expressions for white-listing and black-listing additional URLs that are allowed into the crawl. The current launch mechanism does not provide a way to pass those in or register them.

Option 0, the very simplest approach, is to update the whitelist and blacklist text files based on W3ACT data prior to launch. This would work with the current crawler but not the newer scalable crawler, which is designed to operate pretty-much continuously.

Option 1 is to use the existing blacklist/whitelist logic (i.e. org.archive.modules.deciderules.MatchesListRegexDecideRule instances). When whitelist/blacklist requests come in, the handler updates the local beans. This means the lists are global to the whole crawl on that crawler.

Option 2 is the same but to somehow associated it with a seed/source or sheet SURT. This means the whitelists and blacklists can be made to only operate in the context of the URLs found via a particular seed. It is not clear whether this is a large advantage or not!

However, these last two options do not work with the new scaling method, as discovered-URLs are delivered to hosts based on the keys of the target URL, and so if the white/blacklists refer to different hosts than the seed, the right crawler probably won't know about the whitelist.

One alternative would be to separate the crawl streams for crawl instructions versus discovered URLs, but this would be complicated as a simple implementation would mean all crawlers fetched all seeds. i.e. if we use a shared crawl-job stream rather than putting everything in the distributed to-crawl stream. This would need a separate Kafka listener that set up the crawl configuration as instructed, but then just passed the URL on to the to-crawl stream. This is actually a quite reasonable set-up, but requires a fair amount of work.

A different approach would be to have a separate Scope Oracle, i.e. host some crawl configuration as a separate service and pull that in as needed. But that only really moves the problem to a separate component, and isn't much of an advantage unless it's part of a full de-coupling of the crawl modules, i.e. introduction a discovered-uris stream and running a separate process to scope the URLs and pass them onto the to-crawl stream.

In summary, a crawl-job launch mechanism is probably the best approach in the nearish-term. We could base it on Brozzler's job configuration, and every crawler instance would use it to set/update the crawler configuration. Because passing the URL on to the right instance without duplication is difficult, we could just make it a two-step launch. i.e. send the crawl-job configuration first, and then send in the to-crawl URL a little while after?

Longer term, the Scope Oracle is probably a better approach. It would update itself based on the latest job configurations from W3ACT/wherever, and could be plumbed into H3 as a REST service or as a separate discovered-uri stream consumer.

anjackson commented 5 years ago

Now we support defining the scope via crawl command messages, we can do this easily enough for individual URLs or SURT prefixes. However, to support e.g. regex too, we probably need to shift to a 'Scope Oracle' approach.

anjackson commented 2 years ago

Note that any syntax for the crawl feed should match https://github.com/webrecorder/browsertrix-crawler and use include:REGEX and exclude:REGEX.