Crawler ignores Crawl-delay from the host's robots.txt

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Find a website where robots.txt has something similar to
User-agent: *
Crawl-delay: 80
2. Run the crawler with a parser

What is the expected output? What do you see instead?
Expected: The crawler thread works on a different host or waits until allowed 
to fetch a page from the delayed host.
Instead: No delay is recognized or set to HostDirectives.

What version of the product are you using? On what operating system?
2.6.1, Fedora 14

Please provide any additional information below.
The only delay the crawler waits is the politeness delay set for the fetcher in 
crawler4j.properties file

Original issue reported on code.google.com by janne.pa...@documill.com on 20 Jul 2011 at 7:10

GoogleCodeExporter commented 9 years ago

I made a patch for this issue.

It parses the delay in the same manner other settings are parsed from 
robots.txt. HostDirectives stores the time in milliseconds. HostDirectives also 
now doesn't store the previously accessed time but the previously given time 
for an access.
It calculates if the current time is higher than the previously given access 
time added with the delay. If current time is higher, it returns that and sets 
it to the previously given access time. If not, it adds the delay to the access 
time, stores the new one and returns it.
After this RobotstxtServer calculates how long WebCrawler should sleep and 
returns the value in milliseconds to it. Which, if the value is higher than 0, 
then sleeps the amount before fetching the new Page.

This isn't the most elegant solution as I'm not exactly sure where you wanted 
the call to be made from. But it works great.

The major issue with this solution still is optimization. If you have multiple 
threads and they all try to access a host that has set the crawl delay to over 
a minute you will have them wait a long time instead of going to check urls of 
other hosts.

One solution could be making a separate WorkQueues object for each host and 
then cycle them with each request. Another could be having the crawler cycle 
through its current list in hopes there is a link to a different host.

Original comment by janne.pa...@documill.com on 20 Jul 2011 at 10:09

Attachments:

[crawl4j - robotstxt crawl delay patch.tar.gz](https://storage.googleapis.com/google-code-attachments/crawler4j/issue-58/comment-1/crawl4j - robotstxt crawl delay patch.tar.gz)

GoogleCodeExporter commented 9 years ago

Ah, the previous version of HostDirectives.java I attached in the comment above 
was missing a line. Here's a fixed, properly working copy.

Original comment by janne.pa...@documill.com on 20 Jul 2011 at 12:23

Attachments:

HostDirectives.java

GoogleCodeExporter commented 9 years ago

Is there any chance to handle CRAWL-DELAY future in nearest future ?

Original comment by marcing...@gmail.com on 14 Apr 2014 at 10:14

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:07

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:10

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:11

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

https://code.google.com/r/marcingosk-crawler4j/source/list

Original comment by avrah...@gmail.com on 23 Sep 2014 at 1:59

jungjonghun / crawler4j

Crawler ignores Crawl-delay from the host's robots.txt #58