overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
101 stars 9 forks source link

User rate limit changed to captures per minute #20

Closed AgostinoSturaro closed 1 year ago

AgostinoSturaro commented 1 year ago

Quote from the Save Page Now changelog

2023-01-22 The user rate limit mechanism changed from counting concurrent captures to limiting captures per minute. Anonymous users can do 4 captures per minute and authenticated users can do 12 captures per minute

It's "captures per minute", measured over the last 60 seconds, see page 11 of API doc

By “concurrent captures”, we mean captures performed in the last 60 sec.

However, it is now much easier to hit this limit. Can you take this into account? Thank you.

overcast07 commented 1 year ago

If this has actually been implemented (I didn't even notice until you pointed it out), I think it's being done by giving the user the "overloaded" status message and delaying the relevant captures so that (without authentication) about 4 to 5 captures are processed each minute, or about 270 captures per hour. You can definitely still make more than 4 requests per minute without authentication, and they just get delayed by the server automatically, although there is also still a limit on concurrent captures that can kick in.

I think it would make the most sense to take this into account by properly handling the edge case where you have to wait more than 10 minutes for the capture to start, i.e. #19.

AgostinoSturaro commented 1 year ago

The "overloaded" message is a bit different. If I recall correctly, the user limit says something about "you" having reached your limit of something. Then there's yet another message, about the archive having received too many requests for a specific website, like github.com

So, it's 3 different things:

  1. overloaded in general
  2. too many requests about a website
  3. too many requests by the user

For the general overloading, there's an API, mentioned in the changelog

2022-04-05 New API endpoint http://web.archive.org/save/status/system to notify applications if SPN is overloaded.

For the specific website, see page 9 of the API doc I don't know the error code

Artificial delays for multiple concurrent captures on the same host. When we run more than 20 concurrent captures on the same host, we introduce an artificial delay on subsequent captures to avoid overloading the target and blocking SPN2. The delay algorithm is: When concurrent_capture_number > 20 for the same host, delay concurrent_capture_number/5 sec. For example: if concurrent_capture_number = 50, delay a new capture by 50/5 = 10 sec.

For the capture limit, it's measured over the last 60 seconds, as stated on page 11 of the API doc

By “concurrent captures”, we mean captures performed in the last 60 sec.