ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Re-crawl logic causes over-crawling of sites with many page-level Targets #35

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

The launch_ts approach works well, but when we have a large Target with multiple individual page-level Targets (e.g. BBC News homepage versus individual articles), the current implementation tends to over-crawl. e.g. if a particular article is re-crawled, it sets a new launch_ts and as this in inherited for discovered URLs, it causes the whole site to get re-crawled.

In practice, we probably only want to inherit thelaunch_ts for the homepage of each site.

  1. we could not inherit launch_ts and rely on the recrawl sheet frequency
  2. only inherit launch_ts for URLs with no path (this will not work quite as expected if the whole host does not have a suitable record)
  3. as 2. but use additional data from W3ACT to spot highest-level URL on a site. Will still fail when we want to crawl a subsection of a site at a different frequency.

In truth, we want surt-prefix-scoped launch_ts values. i.e. rather than inherit the launch_ts value directly, add a new mechanism that gets configured when the URL comes in. This is like the sheets mechanism, but sheets are quite difficult to use for this, as you'd need to have a sheet for every SURT prefix, with the launch_ts set. It seems likely that this large number of sheets would not work reliably.

An alternative would be to create a new Processor that gets configured with the SURT-to-launch_ts mapping and applies the right one prior to scoping of candidates.

anjackson commented 5 years ago

Looking at the implementation, having a sheet for each Target (that needs one) seems like it should work fine. Implemented a basic version that always creates a custom sheet for each URL.

This is overkill for URls that are just being submitted via nomination/patching, so need to make it happen only if needed. i.e. if it's a Real Target Launch - not sure how best to indicate that. Perhaps simplest just to get the user pass in a map of key-value pairs, but not sure how to ensure we get all the types right.

anjackson commented 5 years ago

Relying on JSON serialisation seems to be fine. New implementation (b2f3862de1d04f13fb508da7fe8fce4b7e91921e) allows a custom targetSheet to be set when submitting a launch request, allowing arbitrary properties to be set at that level.

We now have

We don't need all these, but we do need a way of saying "get a updated version of this specific URL only" rather than only re-crawling a whole SURT. So, cleaning up the property version, and allowing that to take precedence seems appropriate.

Not sure we need the recrawlInterval at all now.

anjackson commented 5 years ago

Minor crawl problems observed due to skipping the disposition chain too early, which meant we were failing to clear the uriUniqFilter memory of a URL if it got rejected by an OUT_OF_SCOPE decision. Fixed in d0a82ba and to be rolled into 2.4.6.

anjackson commented 5 years ago

Much better, though still some oddities. The less-than-24-hours re-crawl frequency now backfires a bit, permitting over-crawling. The launch-timestamp mechanism works better so recrawl-frequency should be removed/phased out.

Sites manually checked appear fine, except BBC News oddly. Currently investigating.

anjackson commented 5 years ago

Hm, it's acting like it's queued but low priority. This again can easily happen with the short re-crawl frequency - it can get rediscovered via long hop-paths, which get low priority, and then struggle to get to the front of the queue as new/unqueued seed launches come in.

anjackson commented 5 years ago

This morning I modified the launcher to avoid using the recrawl internal. Hopefully daily targets will be more reliable now.

anjackson commented 5 years ago

Working better, although some seeds did not start working because they had been rediscovered and queued with low priority.

This is perhaps the remaining problem. If target URLs get discovered before being officially re-launched, and get enqueued, then the re-launch doesn't get through (because the Frontier's unique URI filter cache already has it). e.g. I just re-launched, and it re-consumed the crawl queue, but re-used the scope, so the BBC News homepage may get discovered and enqueued before the launch request comes in.

For frequent crawling, it probably makes the most sense NOT to re-use the scope between crawls, as we're not re-using anything else. That should avoid the issue.

Alternatively, we could permit seeds and forcing to add the request to the frontier even if it's already present. The existing filter does support that, if we set forceFetch when submitting URLs. For our seeds I think this makes sense, even though it may allow a small amount of over-crawling.

anjackson commented 5 years ago

Actually as of 2.4.8 using forceFetch is a bad idea as it bypasses the RecentlySeen filter (i.e. it really does force a fetch). I'm modifying the RecentlySeen to make this optional (as for our crawls, it doesn't make sense to force the crawl if it's been seen recently enough). So, 2.4.9 should be good for using forceFetch to ensure we get enqueued into the frontier, even if the crawl turns out to be unnecessary later.

anjackson commented 5 years ago

Verified as now working fine.