ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Spot GONE links and annotate them? #62

Open anjackson opened 3 years ago

anjackson commented 3 years ago

Given the OutbackCDX lookup, do we know the status code of the previous visit, and hence, can we check if a url has changed status code and e.g. 200->40x. If no, add an annotation that says GONE:403/MOVED:307/ERROR:500/OK:200 (or perhaps just StatusChangedFrom:200 as the new status is already held in the log line) when we notice it?

anjackson commented 3 years ago

This might be better as a post-crawl analysis, rather than loading more into the crawler, but it's worth considering.