internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

Question on robots.txt #371

Closed oschihin closed 2 years ago

oschihin commented 3 years ago

Websites / departments in my organisation usually have a robots.txt with the following simple entry:

User-agent: *
Disallow: /*?*
Sitemap: https://www.[domain].org/sitemaps/[domain].xml

I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set <property name="robotsPolicyName" value="ignore"/>, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:

ato commented 3 years ago

Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule Disallow: /*?* will be interpreted as matching paths that actually start with the literal string /*?. It will not match /index.html?foo.