ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Ensure blocked sites are fully blocked. #85

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

Unfortunately, this sequence of scope rules has an undesirable effect:

https://github.com/ukwa/ukwa-heritrix/blob/c54e4f75fe42c031366289ff3031a68361081920/jobs/frequent/crawler-beans.cxml#L150-L158

Specifically, if we're trying to stop the crawler from visiting a site, requests for pre-requisites still slip through. Hence folks getting irritated with us requesting robots.txt when all the crawler is going to do is throw out the rest of the crawl queue.

The order of these should be switched to exclusions always have the final word.

anjackson commented 1 year ago

On further investigation, this was not the root issue and is unlikely to happen in practice (as a URL that requires the pre-requisite would be needed to instigate the process, and those should be blocked).

The root cause was adding domain.com to the exclusion list does not automatically block www.domain.com, i.e. blocked URLs are not fully canonicalized. The syntax +domain.com can be used to generate a http://(com,domain, block.