Closed anjackson closed 5 years ago
So, the issue is the final DispositionProcessor
in the DispositionChain
, which sees the -5000: OUT_OF_SCOPE
status and interprets it as a failed robots.txt download.
I don't think it actually makes any sense that the DispositionChain
is always executed, as for -5000: OUT_OF_SCOPE
(and perhaps -5002: BLOCKED_BY_CUSTOM_PROCESSOR
?) there is nothing to Dispose. So, we could add a processor that skipped to the FINISH
of the DispositionChain
under those circumstances.
The crawl log itself is updated in the ToeThread
, so that would still happen, but note that the Kafka crawl log would not be updated with these events (unless that Processor
is moved up the chain so it happens before the proposed SkipDispositionProcessor
).
See, for e.g.
i.e. robots.txt getting blocked because we've seen it recently, but this leads to a cascade of
-61
events.