ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Crawler obeying nofollow directive when instructed to ignore robots.txt #64

Closed anjackson closed 3 years ago

anjackson commented 3 years ago

When crawling a site, links marked nofollow were not followed even through the site was marked as ignoreRobots. It turns out that using the calculateRobotsOnly method used in the ignoreRobots sheet does not cover all cases. The HTML extractors all look at the current Robots Policy to determine whether to follow links, and if not, never extract the links in the first place. This happens whether or not calculateRobotsOnly is set, which refers to the handling of extracted URLs.

The code has been edited to switch to changing the policy, although this was commented out without indicating why. I'm going to ask around about usage of calculateRobotsOnly.

https://github.com/ukwa/ukwa-heritrix/blob/dd1e4e161afb39fd579c52e8a0ef8c0bbd19a715/jobs/frequent/sheets.xml#L274-L283

anjackson commented 3 years ago

Well, no feedback, so we'll push on with dd1e4e161afb39fd579c52e8a0ef8c0bbd19a715.