When crawling a site, links marked nofollow were not followed even through the site was marked as ignoreRobots. It turns out that using the calculateRobotsOnly method used in the ignoreRobots sheet does not cover all cases. The HTML extractors all look at the current Robots Policy to determine whether to follow links, and if not, never extract the links in the first place. This happens whether or not calculateRobotsOnly is set, which refers to the handling of extracted URLs.
The code has been edited to switch to changing the policy, although this was commented out without indicating why. I'm going to ask around about usage of calculateRobotsOnly.
When crawling a site, links marked
nofollow
were not followed even through the site was marked asignoreRobots
. It turns out that using thecalculateRobotsOnly
method used in theignoreRobots
sheet does not cover all cases. The HTML extractors all look at the current Robots Policy to determine whether to follow links, and if not, never extract the links in the first place. This happens whether or notcalculateRobotsOnly
is set, which refers to the handling of extracted URLs.The code has been edited to switch to changing the policy, although this was commented out without indicating why. I'm going to ask around about usage of
calculateRobotsOnly
.https://github.com/ukwa/ukwa-heritrix/blob/dd1e4e161afb39fd579c52e8a0ef8c0bbd19a715/jobs/frequent/sheets.xml#L274-L283