Question on robots.txt - Githubissues

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Other

2.77k stars 757 forks source link

Websites / departments in my organisation usually have a robots.txt with the following simple entry:

User-agent: *
Disallow: /*?*
Sitemap: https://www.[domain].org/sitemaps/[domain].xml

I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set <property name="robotsPolicyName" value="ignore"/>, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:

How does heritrix deal with the Disallow statement above? In my interpretation, it excludes just all URLs with a ? anywhere. But could heritrix treat this more "greedy", i.e. disallow everything?
Does heritrix consider the Sitemap statement?

internetarchive / heritrix3

Question on robots.txt #371