sangaline / wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
http://sangaline.com/post/wayback-machine-scraper/
ISC License
408 stars 73 forks source link

Error 429 + Scraper gives up #19

Open avelican opened 7 months ago

avelican commented 7 months ago

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.

( See discussion on similar project here https://github.com/buren/wayback_archiver/issues/32 )

The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.

As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"

...
2023-11-09 22:09:57 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://web.archive.org/cdx/search/cdx?url=www.example.com/blog/stuff&output=json&fl=timestamp,original,statuscode,digest> (failed 3 times): 429 Unknown Status
2023-11-09 22:09:57 [scrapy.core.engine] INFO: Closing spider (finished)

I see two issues here:

  1. Add a global rate limit (I don't think the concurrency flag covers this?) 1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.) Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
  2. (Unnecessary if previous points handled?) Increase retry limit from 3 to something much higher? Again, if we approach scraping with a "backoff"

TODO:

  1. Find out exactly what the rate limit is: May be 5 per minute, or may be 15 per minute? (12 or 4s delay respectively.) They seem to have changed it several times. Not sure if there are official numbers. https://archive.org/details/toomanyrequests_20191110 This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
  2. Find out if this project already does rate limiting. Edit: Sorta, but not entirely sufficient for this use case? (e.g. no 5-minute backoff on 429, autothrottle does not guarantee <X/minute, etc.)

Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency: __main__.py

'AUTOTHROTTLE_START_DELAY': 4, # aiming for 15 per minute

and

parser.add_argument('-c', '--concurrency', default=1.0, help=(

This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.

More work needed. May report back later.

Edit: AutoThrottle docs say AUTOTHROTTLE_TARGET_CONCURRENCY represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.

avelican commented 7 months ago

Update: added download_delay = 4 to mirror_spider.py. This seems to make auto_throttle unnecessary (?), so I disabled it. But now I get very low pages/minute: avg 6/min and sometimes only 2-3 per minute! (Edit: just got 1/min...) I thought this is due to Wayback being very slow sometimes? I haven't measured. Still, 30s/req seems very fishy.

Edit: Total request count is about double of saved pages, which brings avg to >12/sec, which is close to the ideal of 15. Not sure where all those extra requests are coming from, will need to run this in debug mode.

JamesEBall commented 7 months ago

I am getting the same issue too. I need to scrape 150k pages to rebuild a website, but am constantly hitting the rate limit on archive's servers.