mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Robots.txt parser is not working with Disallow: * #195

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Crawl a host with robots.txt like
User-agent: *
Allow: /1.html
Disallow: *

2. assertTrue(RobotstxtServer.allows(host/1.html)) works,
but assertTrue(RobotstxtServer.allows(host/4.html)) should be false instead.

What is the expected output? What do you see instead?
With RobotstxtServer reading this robots.txt I should not be able to access all 
addresses except 1.html
but I can 4.html.

What version of the product are you using?
3.4.0-SNAPSHOT

Please provide any additional information below.

If the robots.txt is more explicit:
User-agent: *
Allow: /1.html
Disallow: /4.html
I'm able to access 1.html but not 4.html.

Original issue reported on code.google.com by acrocraw...@gmail.com on 22 Feb 2013 at 8:53

GoogleCodeExporter commented 9 years ago
A robots.txt like 
User-agent: *
Allow: /1.html
Disallow: /
works fine.

I'm not sure whether Disallow: * is just a wrong way of specifying pages that 
are not allowed, but would expect it to work since Allow: * is a valid entry in 
robots.txt.

Original comment by acrocraw...@gmail.com on 25 Feb 2013 at 8:28

GoogleCodeExporter commented 9 years ago
Disallow: * is not a valid syntax. It should be Disallow: /
The same for Allow. See 
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=3523
7&rd=1 for samples.

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Mar 2013 at 7:38