ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Blocked re-crawl of robots.txt causing failure cascade for host #34

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

See, for e.g.

2019-03-28T22:32:15.322Z -5000          - https://conservativehome.blogs.com/robots.txt LLLREELLELRLLLEPR http://conservativehome.blogs.com/robots.txt unknown #022 - - tid:87632:https://theconversation.com/brexit-an-escape-room-with-no-escape-109935/ - {"scopeDecision":"REJECT by rule #13 OutbackCDXRecentlySeenDecideRule"}

i.e. robots.txt getting blocked because we've seen it recently, but this leads to a cascade of -61 events.

anjackson commented 5 years ago

So, the issue is the final DispositionProcessor in the DispositionChain, which sees the -5000: OUT_OF_SCOPE status and interprets it as a failed robots.txt download.

I don't think it actually makes any sense that the DispositionChain is always executed, as for -5000: OUT_OF_SCOPE (and perhaps -5002: BLOCKED_BY_CUSTOM_PROCESSOR?) there is nothing to Dispose. So, we could add a processor that skipped to the FINISH of the DispositionChain under those circumstances.

The crawl log itself is updated in the ToeThread, so that would still happen, but note that the Kafka crawl log would not be updated with these events (unless that Processor is moved up the chain so it happens before the proposed SkipDispositionProcessor).