Incorrectly parsing robots.txt

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Crawl www.springfieldclinic.com

What is the expected output? What do you see instead?
Most pages should be crawled.  Instead you get the following example:
Did not crawl page 
http://www.springfieldclinic.com/PatientCareServices/Specialties due to Page 
[http://www.springfieldclinic.com/PatientCareServices/Specialties] not crawled, 
[Disallowed by robots.txt file], set IsRespectRobotsDotText=false in config 
file if you would like to ignore robots.txt files.

What version of the product are you using? On what operating system?
Version 1.1.1 on Windows Server 2008R2

Please provide any additional information below.
http://www.frobee.com/robots-txt-check shows that the robots.txt file on the 
SpringfieldClinic.com site is correct and our bot has access to the above path.

Original issue reported on code.google.com by Ricow...@gmail.com on 26 Sep 2013 at 11:03

GoogleCodeExporter commented 9 years ago

I was unable to reproduce your error. I ran the Abot.Demo console with only the 
following config changes in Abot.Demo/App.config...

-Did NOT change isRespectRobotsDotTextEnabled="false"
-Changed maxPagesToCrawl="10" to 50
-Changed minCrawlDelayPerDomainMilliSeconds="1000" to 0

And if you look at the attached log file you can see that that page was crawled 
without issue. Robots logic was completely ignored as expected since 
isRespectRobotsDotTextEnabled was set to false.

[INFO ] - Page crawl complete, Status:[200] 
Url:[http://www.springfieldclinic.com/PatientCareServices/Specialties] 
Parent:[http://www.springfieldclinic.com/] - [Abot.Crawler.WebCrawler]

Original comment by sjdir...@gmail.com on 27 Sep 2013 at 12:35

Attachments:

log.txt

GoogleCodeExporter commented 9 years ago

This report is about robots.txt being incorrectly parsed, so 
isRespectRobotsDotTextEnabled needs to be set to "true" to replicate the issue. 
 Sorry, I thought this was obvious or I would have specified it in the report.

We *cannot* ignore the robots.txt file, because we do not want to crawl folders 
and files that it disallows.

Original comment by Ricow...@gmail.com on 27 Sep 2013 at 12:42

GoogleCodeExporter commented 9 years ago

Even with the setting set to true I was still able to crawl the site you 
mentioned above including the link you were being blocked on after setting the 
config value to true. See the attached log file. 

However, i did notice that the current IRobotsDotText impl disallowed all urls 
that are external to the root url. Just added/fixed this bug in another 
branch...

https://code.google.com/p/abot/issues/detail?id=118

Can you try to do the following?
-Use a fresh checkout of v1.1.1 source code.
-Make the following changes in the Abot/Abot.Demo/App.config
-Change isRespectRobotsDotTextEnabled="false" to true
-Change maxPagesToCrawl="10" to 50
-Change minCrawlDelayPerDomainMilliSeconds="1000" to 0
-Run the demo ctr+f5, enter http://www. 
-Check the log file at Abot.Demo/log.txt for an entry like the following...
[INFO ] - Page crawl complete, Status:[200] 
Url:[http://www.springfieldclinic.com/PatientCareServices/Specialties] 
Parent:[http://www.springfieldclinic.com/] - [Abot.Crawler.WebCrawler]
-Attach that log file if you are still seeing the issue

Original comment by sjdir...@gmail.com on 27 Sep 2013 at 1:18

Attachments:

log.txt

GoogleCodeExporter commented 9 years ago

That seems to have fixed our issue, thank you!

Original comment by Ricow...@gmail.com on 27 Sep 2013 at 2:55

GoogleCodeExporter commented 9 years ago

Are you saying the https://code.google.com/p/abot/issues/detail?id=118 fixed 
it? Or are you saying that when you did a fresh checkout of v1.1.1 (which does 
not have the fix) you were unable to repro?

Original comment by sjdir...@gmail.com on 27 Sep 2013 at 4:41

GoogleCodeExporter commented 9 years ago

A new compile from a fresh checkout of the 1.1.1 code base fixed the error.  
Using the pre-compiled dll of v1.1.1 caused the issue to show up again.  The 
problem appears to be only in the pre-compiled dll.

Incidently, I was unable to load the 1.1.1 source code into either VS 2012 or 
2010 without Visual Studio crashing, until I removed the unit test projects 
from the solution file.  I don't know how that could be related, but either way 
I'm going to use my compiled version of 1.1.1 and consider the issue closed.

Original comment by Ricow...@gmail.com on 30 Sep 2013 at 12:23

GoogleCodeExporter commented 9 years ago

Fyi, just added a discovery about the solution crashes to
https://groups.google.com/forum/#!topic/abot-web-crawler/gEy6y3_GT6U

Original comment by sjdir...@gmail.com on 13 Nov 2013 at 10:33

GoogleCodeExporter commented 9 years ago

We have encountered a new site that cannot be crawled when 
isRespectRobotsDotTextEnabled is set to "true", despite the fact that their 
robots.txt does not disallow the root folder of their site.  From looking at 
the abot source, this appears to be a problem within the robots.dll itself.  Is 
there a newer version of this dll available?

The site in question is http://www.kendrick.org

When abot.demo is run with the app.config changes stated above, the log file 
simply states:
Crawl complete for site [http://www.hendricks.org/]: [00:00:02.5093376] - 
[Abot.Crawler.WebCrawler]

The console window, however, states this:
Disallowed: http://www.hendricks.org/ - Page [http://www.hendricks.org/] not 
crawled, [Disallowed by robots.txt file], set IsRespectRobotsDotText=false in 
config file if you would like to ignore robots.txt files.
[2013-12-16 11:31:11,895] [1] [INFO ] - Crawl complete for site 
[http://www.hendricks.org/]: [00:00:02.5093376] - [Abot.Crawler.WebCrawler]

Original comment by Ricow...@gmail.com on 16 Dec 2013 at 6:40

GoogleCodeExporter commented 9 years ago

Hi, I'm not having any issue crawling http://www.kendrick.org/. I also get a 
404 when looking for their robot.txt file http://www.kendrick.org/robots.txt. 

I attached my log file and app.config file for the demo project. Can you still 
repo the problem?

Original comment by sjdir...@gmail.com on 16 Dec 2013 at 10:12

Attachments:

GoogleCodeExporter commented 9 years ago

Sorry, I typoed the first instance of the url.  It's http://www.hendricks.org/

Original comment by Ricow...@gmail.com on 16 Dec 2013 at 10:30

GoogleCodeExporter commented 9 years ago

At a quick glance it looks like the site's robots.txt file is using a 
querystring format to disallow which I do not believe is supported by the 
robots.txt spec. If you believe that it should please add an issue to 
https://github.com/sjdirect/nrobots. Also if you feel like contributing back 
please feel free to fix the issue and submit a pull request.

Thanks!!

Original comment by sjdir...@gmail.com on 17 Dec 2013 at 7:55

GoogleCodeExporter commented 9 years ago

Excellent, thank you.  I will alert the client about their non-standard 
robots.txt entry and follow up on the nrobots hub.

Original comment by Ricow...@gmail.com on 17 Dec 2013 at 2:17

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 30 Dec 2013 at 3:13

Changed state: Invalid

opensangja / abot

Incorrectly parsing robots.txt #117