internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 763 forks source link

Heritrix 3.3: robots.txt wildcard support? #353

Closed wroth closed 3 years ago

wroth commented 4 years ago

Does Heritrix 3.3 support wildcards in robots.txt disallow directives? I urge that either "yes" or "no" answer be added to the documentation.

From my experimentation, it appears that it does not support wildcards. E.g. Disallow: /*/output/

still crawled URLs like /docview/5819152/FE3F6F718FE34D90PQ/5819152/5819152/Record/FE3F6F718FE34D90PQ/input/MathML

ato commented 3 years ago

Heritrix does not currently support the robots.txt wildcard extension. There is an open feature request for it at #250. I've updated the note to webmasters in the github wiki and the old confluence wiki to note this. Thanks!