Closed wroth closed 3 years ago
Heritrix does not currently support the robots.txt wildcard extension. There is an open feature request for it at #250. I've updated the note to webmasters in the github wiki and the old confluence wiki to note this. Thanks!
Does Heritrix 3.3 support wildcards in robots.txt disallow directives? I urge that either "yes" or "no" answer be added to the documentation.
From my experimentation, it appears that it does not support wildcards. E.g. Disallow: /*/output/
still crawled URLs like /docview/5819152/FE3F6F718FE34D90PQ/5819152/5819152/Record/FE3F6F718FE34D90PQ/input/MathML