Open avelican opened 7 months ago
Update: added download_delay = 4
to mirror_spider.py
. This seems to make auto_throttle unnecessary (?), so I disabled it.
But now I get very low pages/minute: avg 6/min and sometimes only 2-3 per minute! (Edit: just got 1/min...)
I thought this is due to Wayback being very slow sometimes? I haven't measured. Still, 30s/req seems very fishy.
Edit: Total request count is about double of saved pages, which brings avg to >12/sec, which is close to the ideal of 15. Not sure where all those extra requests are coming from, will need to run this in debug mode.
I am getting the same issue too. I need to scrape 150k pages to rebuild a website, but am constantly hitting the rate limit on archive's servers.
Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.
( See discussion on similar project here https://github.com/buren/wayback_archiver/issues/32 )
The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.
As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"
I see two issues here:
TODO:
Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency:
__main__.py
and
This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.
More work needed. May report back later.
Edit: AutoThrottle docs say
AUTOTHROTTLE_TARGET_CONCURRENCY
represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.