Closed GoogleCodeExporter closed 8 years ago
I was unable to reproduce your error. I ran the Abot.Demo console with only the
following config changes in Abot.Demo/App.config...
-Did NOT change isRespectRobotsDotTextEnabled="false"
-Changed maxPagesToCrawl="10" to 50
-Changed minCrawlDelayPerDomainMilliSeconds="1000" to 0
And if you look at the attached log file you can see that that page was crawled
without issue. Robots logic was completely ignored as expected since
isRespectRobotsDotTextEnabled was set to false.
[INFO ] - Page crawl complete, Status:[200]
Url:[http://www.springfieldclinic.com/PatientCareServices/Specialties]
Parent:[http://www.springfieldclinic.com/] - [Abot.Crawler.WebCrawler]
Original comment by sjdir...@gmail.com
on 27 Sep 2013 at 12:35
Attachments:
This report is about robots.txt being incorrectly parsed, so
isRespectRobotsDotTextEnabled needs to be set to "true" to replicate the issue.
Sorry, I thought this was obvious or I would have specified it in the report.
We *cannot* ignore the robots.txt file, because we do not want to crawl folders
and files that it disallows.
Original comment by Ricow...@gmail.com
on 27 Sep 2013 at 12:42
Even with the setting set to true I was still able to crawl the site you
mentioned above including the link you were being blocked on after setting the
config value to true. See the attached log file.
However, i did notice that the current IRobotsDotText impl disallowed all urls
that are external to the root url. Just added/fixed this bug in another
branch...
https://code.google.com/p/abot/issues/detail?id=118
Can you try to do the following?
-Use a fresh checkout of v1.1.1 source code.
-Make the following changes in the Abot/Abot.Demo/App.config
-Change isRespectRobotsDotTextEnabled="false" to true
-Change maxPagesToCrawl="10" to 50
-Change minCrawlDelayPerDomainMilliSeconds="1000" to 0
-Run the demo ctr+f5, enter http://www.
-Check the log file at Abot.Demo/log.txt for an entry like the following...
[INFO ] - Page crawl complete, Status:[200]
Url:[http://www.springfieldclinic.com/PatientCareServices/Specialties]
Parent:[http://www.springfieldclinic.com/] - [Abot.Crawler.WebCrawler]
-Attach that log file if you are still seeing the issue
Original comment by sjdir...@gmail.com
on 27 Sep 2013 at 1:18
Attachments:
That seems to have fixed our issue, thank you!
Original comment by Ricow...@gmail.com
on 27 Sep 2013 at 2:55
Are you saying the https://code.google.com/p/abot/issues/detail?id=118 fixed
it? Or are you saying that when you did a fresh checkout of v1.1.1 (which does
not have the fix) you were unable to repro?
Original comment by sjdir...@gmail.com
on 27 Sep 2013 at 4:41
A new compile from a fresh checkout of the 1.1.1 code base fixed the error.
Using the pre-compiled dll of v1.1.1 caused the issue to show up again. The
problem appears to be only in the pre-compiled dll.
Incidently, I was unable to load the 1.1.1 source code into either VS 2012 or
2010 without Visual Studio crashing, until I removed the unit test projects
from the solution file. I don't know how that could be related, but either way
I'm going to use my compiled version of 1.1.1 and consider the issue closed.
Original comment by Ricow...@gmail.com
on 30 Sep 2013 at 12:23
Fyi, just added a discovery about the solution crashes to
https://groups.google.com/forum/#!topic/abot-web-crawler/gEy6y3_GT6U
Original comment by sjdir...@gmail.com
on 13 Nov 2013 at 10:33
We have encountered a new site that cannot be crawled when
isRespectRobotsDotTextEnabled is set to "true", despite the fact that their
robots.txt does not disallow the root folder of their site. From looking at
the abot source, this appears to be a problem within the robots.dll itself. Is
there a newer version of this dll available?
The site in question is http://www.kendrick.org
When abot.demo is run with the app.config changes stated above, the log file
simply states:
Crawl complete for site [http://www.hendricks.org/]: [00:00:02.5093376] -
[Abot.Crawler.WebCrawler]
The console window, however, states this:
Disallowed: http://www.hendricks.org/ - Page [http://www.hendricks.org/] not
crawled, [Disallowed by robots.txt file], set IsRespectRobotsDotText=false in
config file if you would like to ignore robots.txt files.
[2013-12-16 11:31:11,895] [1] [INFO ] - Crawl complete for site
[http://www.hendricks.org/]: [00:00:02.5093376] - [Abot.Crawler.WebCrawler]
Original comment by Ricow...@gmail.com
on 16 Dec 2013 at 6:40
Hi, I'm not having any issue crawling http://www.kendrick.org/. I also get a
404 when looking for their robot.txt file http://www.kendrick.org/robots.txt.
I attached my log file and app.config file for the demo project. Can you still
repo the problem?
Original comment by sjdir...@gmail.com
on 16 Dec 2013 at 10:12
Attachments:
Sorry, I typoed the first instance of the url. It's http://www.hendricks.org/
Original comment by Ricow...@gmail.com
on 16 Dec 2013 at 10:30
At a quick glance it looks like the site's robots.txt file is using a
querystring format to disallow which I do not believe is supported by the
robots.txt spec. If you believe that it should please add an issue to
https://github.com/sjdirect/nrobots. Also if you feel like contributing back
please feel free to fix the issue and submit a pull request.
Thanks!!
Original comment by sjdir...@gmail.com
on 17 Dec 2013 at 7:55
Excellent, thank you. I will alert the client about their non-standard
robots.txt entry and follow up on the nrobots hub.
Original comment by Ricow...@gmail.com
on 17 Dec 2013 at 2:17
Original comment by sjdir...@gmail.com
on 30 Dec 2013 at 3:13
Original issue reported on code.google.com by
Ricow...@gmail.com
on 26 Sep 2013 at 11:03