ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Cope better with partially-failed Web Render events #33

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

Occasionally, due to edge cases or load peaks, the Heritrix engine can think the web rendering failed on the first pass, but in fact it just didn't complete cleanly or within the time-out.

Because it appeared to fail, H3 cannot extract the links to enqueue them. Instead, it defers then retries the download. This was intended to retry using FetchHTTP instead, but in these specific cases, the RecentlySeenDecideRule spots that the URL has been captured (because warcprox recorded it as such during the successful part of the download). It is therefore discarded (-5000) rather than re-crawled.

Unfortunately, this means that the outlinks are never captured. If no other process happens across those links, we don't get anything else from the site.

😞

It's not clear there's a huge amount we can do about this within Heritrix itself. We've coupled the processes together like this on purpose, to avoid sites getting crawled multiple times by multiple crawlers, so decoupling them isn't really an option.

One possibility would be to add/extend our warcprox modules so it can post links to a queue. But this involves putting quite a bit of logic into warcprox that doesn't really belong there.

It's somewhat related to #28, in that we want to make sure we extract links from the rendered DOM, which we can't get hold of very easily.

Another idea would be a post-crawl QA/checking process that scans what has happened to seeds, looks for outlinks at the same time, and posts them to Kafka for download.

I wonder whether it would make sense to crawl a 'pretend' website that makes it easier to pipe these things through Heritrix. e.g. we hit a seed https://www.bl.uk/, and after we deal with it, we enqueue a 'pretend' URL that gives us access to the onreadydom or failing that, the original response. e.g. http://internal.check.service/get?url=https://www.bl.uk. This would be enqueued and extracted as normal by Heritrix, using the onreadydom if that worked, or the normal response if that failed for some reason. Link extraction would proceed as normal. This would give Web Render link extraction a second chance, and also give it a possibility of picking up URLs we missed (srcset etc.).

The main drawback is this would 'pollute' the logs and WARCs with content that didn't really mean what the rest of it means. However, the WARC pollution could be limited by adding an annotation that would be configured to prevent the records being written. The entries in the crawl log are probably fine, and would act as an indicator of what had been done.

Reminder: if we mark WebRendered URLs as -5002 this prevented enqueueing so we had to add processErrorOutlinks=true. If we marked WebRendered URLs as the success status code this had a different problem: Writing to WARCs, and was that it? In which case, we can block the writing instead? Ah, duplicate records sent to OutbackCDX? I think that was it.

anjackson commented 5 years ago

The problem of webrender-api capacity could be dealt with simply by reducing the backlog so that the web rendering fast-fails on overload rather than delaying. But that's only one failure mode.

anjackson commented 5 years ago

n.b. we should probably add monitoring and alerts for

anjackson commented 5 years ago

For #34 I added an annotation that can be used to skip the disposition chain if needed. This means we could go back to emitting the true status code, not processing error outlinks, and add the annotation instead so that OutbackCDX is not updated. This same mechanism could be used to avoid depositing any 'synthetic' URLs.

anjackson commented 5 years ago

This now seems to work acceptably well.