tasfe / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 1 forks source link

Catch & report invalid robots.txt rules that include domain name in the URL path #20

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Specifically catch cases of people putting http://<domain> or <domain> as part 
of the path.

There's the question of what we do in that case, if the domain matches the 
domain used to fetch the robots.txt file. I think we should try to honor the 
intent of the rule, which means pretending like the author of the file didn't 
mess up the syntax.

Original issue reported on code.google.com by kkrugler...@transpac.com on 17 Mar 2013 at 6:36