Closed anjackson closed 1 year ago
On further investigation, this was not the root issue and is unlikely to happen in practice (as a URL that requires the pre-requisite would be needed to instigate the process, and those should be blocked).
The root cause was adding domain.com
to the exclusion list does not automatically block www.domain.com
, i.e. blocked URLs are not fully canonicalized. The syntax +domain.com
can be used to generate a http://(com,domain,
block.
Unfortunately, this sequence of scope rules has an undesirable effect:
https://github.com/ukwa/ukwa-heritrix/blob/c54e4f75fe42c031366289ff3031a68361081920/jobs/frequent/crawler-beans.cxml#L150-L158
Specifically, if we're trying to stop the crawler from visiting a site, requests for pre-requisites still slip through. Hence folks getting irritated with us requesting robots.txt when all the crawler is going to do is throw out the rest of the crawl queue.
The order of these should be switched to exclusions always have the final word.