ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Ensure refusal of robots.txt recrawls does not invalidate cached robots.txt info #40

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

I modified the crawler to skip to the DispositionProcessor rather than to the end of the disposition chain, so that it would get the crawl delay right.

However, this also means -5000 robots.txt events invalidate the robots.txt records and lead to lots of -61 events. This causes this issue to return. So, need to modify the crawler to make this more reliable.

anjackson commented 5 years ago

Short term, reverting that change is the best idea: https://github.com/ukwa/ukwa-heritrix/commit/95bff9d5cf6c1da70ab81dbe7ad5c3b3a2ab09cc

(This only means that any URLs that were guessed/inferred while processing an out-of-scope URL will get enqueued with an overly-conservative politeness delay, which is acceptable).

anjackson commented 5 years ago

Not sure what to do. One option would be to split DispositionProcessor so that it's less of a 'do all the bits and bobs at the end' and instead a processor sequence UpdateRobotsTxt SetCrawlDelay TallyConnectionErrors CheckForcedRetirement. However, this latter option will break sheet-based overrides, which is find for us, I guess.

anjackson commented 5 years ago

Simplest modification to make for now is to make a new DispositionProcessor class that is the same as the core one, but that ignores -5000 OUT OF SCOPE events when considering updating robots.txt.

anjackson commented 5 years ago

Implemented a modified disposition processor: https://github.com/ukwa/ukwa-heritrix/commit/bd497b605e7652fe15509a1b507dd85ab78690ac