Closed anjackson closed 5 years ago
Short term, reverting that change is the best idea: https://github.com/ukwa/ukwa-heritrix/commit/95bff9d5cf6c1da70ab81dbe7ad5c3b3a2ab09cc
(This only means that any URLs that were guessed/inferred while processing an out-of-scope URL will get enqueued with an overly-conservative politeness delay, which is acceptable).
Not sure what to do. One option would be to split DispositionProcessor
so that it's less of a 'do all the bits and bobs at the end' and instead a processor sequence UpdateRobotsTxt
SetCrawlDelay
TallyConnectionErrors
CheckForcedRetirement
. However, this latter option will break sheet-based overrides, which is find for us, I guess.
Simplest modification to make for now is to make a new DispositionProcessor
class that is the same as the core one, but that ignores -5000 OUT OF SCOPE
events when considering updating robots.txt
.
Implemented a modified disposition processor: https://github.com/ukwa/ukwa-heritrix/commit/bd497b605e7652fe15509a1b507dd85ab78690ac
I modified the crawler to skip to the DispositionProcessor rather than to the end of the disposition chain, so that it would get the crawl delay right.
However, this also means
-5000
robots.txt events invalidate the robots.txt records and lead to lots of-61
events. This causes this issue to return. So, need to modify the crawler to make this more reliable.