Closed anjackson closed 5 years ago
The problem of webrender-api
capacity could be dealt with simply by reducing the backlog
so that the web rendering fast-fails on overload rather than delaying. But that's only one failure mode.
n.b. we should probably add monitoring and alerts for
For #34 I added an annotation that can be used to skip the disposition chain if needed. This means we could go back to emitting the true status code, not processing error outlinks, and add the annotation instead so that OutbackCDX is not updated. This same mechanism could be used to avoid depositing any 'synthetic' URLs.
This now seems to work acceptably well.
Occasionally, due to edge cases or load peaks, the Heritrix engine can think the web rendering failed on the first pass, but in fact it just didn't complete cleanly or within the time-out.
Because it appeared to fail, H3 cannot extract the links to enqueue them. Instead, it defers then retries the download. This was intended to retry using
FetchHTTP
instead, but in these specific cases, theRecentlySeenDecideRule
spots that the URL has been captured (becausewarcprox
recorded it as such during the successful part of the download). It is therefore discarded (-5000
) rather than re-crawled.Unfortunately, this means that the outlinks are never captured. If no other process happens across those links, we don't get anything else from the site.
😞
It's not clear there's a huge amount we can do about this within Heritrix itself. We've coupled the processes together like this on purpose, to avoid sites getting crawled multiple times by multiple crawlers, so decoupling them isn't really an option.
One possibility would be to add/extend our
warcprox
modules so it can post links to a queue. But this involves putting quite a bit of logic intowarcprox
that doesn't really belong there.It's somewhat related to #28, in that we want to make sure we extract links from the rendered DOM, which we can't get hold of very easily.
Another idea would be a post-crawl QA/checking process that scans what has happened to seeds, looks for outlinks at the same time, and posts them to Kafka for download.
I wonder whether it would make sense to crawl a 'pretend' website that makes it easier to pipe these things through Heritrix. e.g. we hit a seed
https://www.bl.uk/
, and after we deal with it, we enqueue a 'pretend' URL that gives us access to theonreadydom
or failing that, the original response. e.g.http://internal.check.service/get?url=https://www.bl.uk
. This would be enqueued and extracted as normal by Heritrix, using theonreadydom
if that worked, or the normal response if that failed for some reason. Link extraction would proceed as normal. This would give Web Render link extraction a second chance, and also give it a possibility of picking up URLs we missed (srcset
etc.).The main drawback is this would 'pollute' the logs and WARCs with content that didn't really mean what the rest of it means. However, the WARC pollution could be limited by adding an annotation that would be configured to prevent the records being written. The entries in the crawl log are probably fine, and would act as an indicator of what had been done.
Reminder: if we mark WebRendered URLs as -5002 this prevented enqueueing so we had to add
processErrorOutlinks=true
. If we marked WebRendered URLs as the success status code this had a different problem: Writing to WARCs, and was that it? In which case, we can block the writing instead? Ah, duplicate records sent to OutbackCDX? I think that was it.