Reduce number of tries if wayback machine is overloaded

barkoder commented 1 year ago

The wayback machine is currently overloaded.

Getting a The capture will start in ~3 hours, 40 minutes because our service is currently overloaded. You may close your browser window and the page will still be saved. message.

So captures will not trigger until ~4 hours from time of capture.

Unfortunately spn.sh just keeps trying and pausing and trying and pausing for quite a long time(until Job Failed), just waiting for a captured URL(that won't come), without moving on to the next URLs on the list. Could you add a way for the script to not wait if this particular message is generated upon save by the wayback machine?

So instead of saying a Job Completed it would say Wayback Machine overloaded. Job Scheduled to be Completed in ~3 hours, 42 minutes or something, while the script for that job sleeps in the background for aforementioned time and moves onto the next URL?

You probably know of a better way to resolve this error.

Thanks for the script @overcast07 . It's simply excellent!

barkoder commented 1 year ago

As an aside, they now seem to have reduced the number of parallel capture jobs from 8 to 6, so 429 errors galore.

curl -s -X GET -H 'Accept: application/json' -H 'Authorization: LOW S3accesskey:secret' https://web.archive.org/save/status/user

overcast07 commented 1 year ago

I'm not sure if this is a normal situation that the script needs to handle. Jason Scott has stated on Twitter,

Internet Archive is under attack. Someone is posting tens of thousands of posts, comments and other vectors and is crushing the system. We're working on it, sorry for the inconvenience.

While it would be possible to wait for 3 hours for all of the requests to be processed, I don't think the script would be particularly usable in this situation anyway.

Currently the script limits the number of child processes that can be spawned, both due to inherent device limitations and the server-side limits. To make the script make sense under these conditions, either you would be limited to about 60 captures every 4 hours or something ridiculous like that, or you would have to store job IDs and their corresponding wait times in text files, so that the API can be checked hours later when the job actually starts.

It would be theoretically possible to implement a solution where individual job IDs are stored in separate text files, but I would rather wait and see what happens, since I would really hope that this situation is temporary. If everyone's already waiting for 4 hours, making it easier to submit even more URLs to the backlog would just make the situation worse.

overcast07 commented 1 year ago

I've checked just now and the situation appears to have gone back to normal for the time being, since it didn't tell me the service was overloaded and the captures went through immediately.

AgostinoSturaro commented 1 year ago

@overcast07 I still see similar message when I try to capture frequently-visited websites, like GitHub and the Microsoft doc pages. Sometimes the capture gets delayed by just 10 minutes, sometimes more. I'd say if you get a message that the capture is delayed, just don't retry, it will only make things worse. Could you do that?

overcast07 commented 1 year ago

One of the problems with SPN, especially when using it without authentication, is that the script has to get information by pattern matching the HTML source code of the response. Unfortunately, while there was previously support for handling delays, the format of the error message was changed without me realizing, and the matching doesn't work any more. So, before anything else gets fixed, I need to update that part of the script so that it correctly processes the amount of time to wait.

overcast07 commented 1 year ago

Furthermore, because of the the capture rate now being limited server-side, the -p parameter no longer makes sense to set to a low value like 8 by default, since if the website that's being archived isn't extremely fragile then what the server does will be enough to avoid issues. That part of the script will also have to be changed.

AgostinoSturaro commented 1 year ago

It can be done, it's not the same as the overload issue. There's actually 3 different issues.

overloaded in general, see status
too many requests about a website (by any number of users)
too many requests by the user (over the last 60 seconds)

overcast07 commented 1 year ago

I know it can be done, but I need to know if the format of the message is similar to the format of the other messages, which are

The capture will start in ~[n] seconds because we are doing too many captures of [domain] right now. You may close your browser window and the page will still be saved.

The capture will start in ~[m] hours, [n] minutes because our service is currently overloaded. You may close your browser window and the page will still be saved.

Matching for "capture will start in ..." works for both of these cases, but I haven't been able to get the other type of error message yet. I'm adding a sleep command so that the script actually waits for however long the delay is before starting the captures, which it previously didn't do.

overcast07 commented 1 year ago

I've just managed to get this variant. I wonder how this happened lol

The capture will start in ~ because our service is currently overloaded. You may close your browser window and the page will still be saved.

AgostinoSturaro commented 1 year ago

From the https://web.archive.org/save page, as an anonymous user, I got this one.

Sorry

You have already reached the limit of active Save Page Now sessions. Please wait for a minute and then try again.

overcast07 commented 1 year ago

That's perfectly normal and not a problem, and has always been handled in the script. The script will just retry the submitted URL until it works. Even before the new limit was introduced, if you set the value of -p high enough it would show up within a minute or two.

lock.txt exists specifically in order to handle this particular message: the file is created by one child process (I'm using this to mean instances of the capture() function), other child processes don't attempt URL submissions once the file is created (a file is necessary because otherwise the child processes can't pass data between each other), and then once the first child process submits its URL successfully, the file is removed, and only then are the remaining URLs retried. The script also stops creating new child processes until the file is removed.

AgostinoSturaro commented 1 year ago

Cool. Does it wait for the specified amount of time, or does it just retry in n seconds?

overcast07 commented 1 year ago

It waits for 2 seconds between requests. The request itself takes several seconds, so the effect is that it's retried about every 5 seconds, but it depends on the latency of your internet connection.

You have already reached the limit of active Save Page Now sessions. Please wait for a minute and then try again.

This doesn't literally mean that you have to wait for exactly one minute; if they meant that they would probably have written "wait for 1 minute". I think the intended meaning is "wait for a short time", since (I think) how it works is that once your number of concurrent active processes is below a certain number, it will let you start submitting more URLs. So it can take as little as 5 or 10 seconds, but it can also take more than 1 minute sometimes (mainly if you're archiving lots of very large files).

overcast07 commented 1 year ago

I'm adding a sleep command so that the script actually waits for however long the delay is before starting the captures, which it previously didn't do.

This has been implemented now. I've been testing it myself and it appears to work correctly.

In the edge case that was originally reported, the new default behavior would be to submit 30 URLs, spend 3 hours and 40 minutes waiting, and then check the SPN status API for the results. I don't think this would be ideal behavior in that situation, but at least it would (probably) work. Hopefully that sort of thing remains extremely rare.

AgostinoSturaro commented 1 year ago

If you can keep a list of:

how many downloads have been completed in the last 60 seconds
how many downloads are still being done

You could use it to avoid making the script hit the download limit.

overcast07 commented 1 year ago

The script does not need to handle the limit of captures in the last 60 seconds by itself, as far as I can tell. SPN does not prevent you from submitting new URLs until there are multiple URLs actually being crawled by IA, so it kind of sorts itself out. I'm probably not going to adjust the script's behavior in this context.

how many downloads are still being done

The script already keeps track of how many capture jobs are active. However, this is not done in the same way that SPN keeps track of them, because the script also counts URLs that haven't been submitted yet; URLs that have been submitted but which are not being crawled yet; and URLs that have finished being crawled that the script doesn't know about yet. As such, there is no reason for the script's default number of maximum parallel capture jobs to correspond to the server-side limit, because we can't calculate it directly.

We could use https://web.archive.org/save/status/user to directly get the server's count of the number of active processes, but there is a limit to the overall rate of web requests that the script can send to web.archive.org, and the script is usually operating close to that limit because of the need to check the status API frequently. If you go over about 1 request per second, the site serves you a 429 error and you can't send any web requests for a minute. The script can already tell if the limit has been reached by submitting new URLs, which it has to do anyway in order to work, so it would be counterproductive to also use a completely separate way of checking if the limit has been reached.

As far as I can tell, there is no time penalty associated with getting the "You have already reached the limit of active Save Page Now sessions" error message. It would not be meaningfully more efficient to switch to a different method, even if we didn't need to worry about getting 429 errors.

overcast07 commented 1 year ago

I hope this explanation makes sense. I'm not trying to avoid making changes to the script, but just trying to sort out the logic of whether or not it would actually be beneficial to make particular changes to the script.

overcast07 commented 1 year ago

how many downloads have been completed in the last 60 seconds

This isn't possible to calculate directly, because the script doesn't actually know when each capture job was completed. By default, if -p is set to 30, then the script also waits 30 seconds in between checking the status API for each URL, so if it gets the response that a capture was completed, then it could have been at any time in the past 30 seconds. We can't increase the frequency of the polling of the status API because we need to avoid 429 errors.

In the final API response from the status API for each URL, we do get the time it took to complete the job (e.g. "duration_sec":12.8) and the time that the main URL was captured (e.g. "timestamp":"20230226190000"), but I don't think this would take into account how long it took for the first URL to load, so it wouldn't be quite exact either.

AgostinoSturaro commented 1 year ago

The explanation makes sense overall. However, there's a couple of things I don't understand.

The API say when the capture was complete, so it would be possible to make a time difference and figure out how many downloads were completed within the last n seconds. Add the number of downloads we are still waiting to complete, and you have the number to tell if we are within the limit.

However, I don't understand this part

if -p is set to 30, then the script also waits 30 seconds in between checking the status API for each URL

If 30 downloads all start together, and they all complete within a minute, don't we hit the limit of 12 downloads per minute? I'm surprised it works.

overcast07 commented 1 year ago

If 30 downloads all start together, and they all complete within a minute, don't we hit the limit of 12 downloads per minute?

The first ~4 are started together, but then after that the script waits 2.5 seconds between starting captures. It's specifically 2.5 seconds because of the need to avoid 429 errors. I haven't changed this behavior recently.

The limit of 12 captures per minute doesn't seem to be something that we need to be concerned about in the context of the script's code, because the server seems to take care of it. The delays that are introduced appear to effectively reduce the rate to 12 captures per minute even if you submit URLs more frequently than that.

overcast07 commented 1 year ago

The messages are actually a little bit misleading. If it says

The capture will start in ~1 minute because our service is currently overloaded.

this actually just means that you've hit the limit of 12 or 4 URLs per minute. In other words, it's saying you're the one overloading the service. Of course, any individual user can't make much of a dent in their overall traffic, since there are millions of URLs being captured per day, so it's not entirely true.

The difference in the length of the delays appears to be solely dependent on whether or not you're authenticated, and thereby whether your limit is 12 or 4.

AgostinoSturaro commented 1 year ago

Thanks for the explanation.

If needed, the field status_ext does have a clearer error code, see page 5 and 6 of the API doc

error:user-session-limit User has reached the limit of concurrent active capture sessions.

By the way, looking at the error code, some are not recoverable, for example:

error:invalid-url-syntax Target URL syntax is not valid.

error:invalid-host-resolution Couldn’t resolve the target host.

You might want to treat them separately.

overcast07 commented 1 year ago

These error codes could only be produced when initially submitting the URLs, but you can't get them if you aren't authenticated, because they aren't included in the HTML. But whether or not you're authenticated, it gives you the plain English error messages, so those strings can be checked for in both cases.

The script only checks error codes when dealing with the status API endpoint, which always returns JSON. You can't get these three errors from that API endpoint, since if you're checking it in the first place then the relevant URL has already been submitted successfully.

The script only needs to check whether the error is one that is known to be recoverable. If the error isn't one of them then the capture job is immediately considered to be failed, so it doesn't need to have a list of all of the non-recoverable error strings.

error:invalid-url-syntax

It's pretty much impossible to get this error from the API anyway. If you do try to archive invalid URLs for some reason, then the server gives you a 400 Bad Request error instead of a normal JSON or HTML response.

barkoder commented 12 months ago

I have a text file containing 1000s of URLs.

I begin archiving them using spn.sh -p 10 -nq -a 'S3KEY:S3PASS' -d 'skip_first_archive=1&capture_outlinks=1&capture_screenshot=1&if_not_archived_within=10000d&delay_wb_availability=1' list.txt

For the first few URLs, I get

The capture will start in ~4 seconds
The capture will start in ~5 seconds
The capture will start in ~6 seconds

And then out of nowhere,

The capture will start in ~10 hours 40 minutes

This capture is now going to eat 1 p job for the next ~10 hours.

And then back again to

The capture will start in ~5 seconds
The capture will start in ~6 seconds

As spn.sh works its way down the list, more and more super long capture jobs slowly end up saturating all 10 available ps , eventually, completely stalling the archival process. All 10 parallel jobs are now sleeping.

Proposal - could you please add an additional flag, where the user may choose to have the job immediately fail(without any retries) if the "capture will start in time" message returned by the wayback machine, exceeds a certain time specified by the user in Nh Nm Ns ?

I would rather spn.sh skip URLs on the list and move on to the rest of the list, than sleep for 10 hours. I could get back to the links I didn't capture later.

Thanks!

overcast07 commented 12 months ago

Do you mean that the delayed captures should be treated as invalid or as failed?

If the capture is going to start in 10 hours then it will still start eventually, so to me it would make sense to mark them as failed, so that the script eventually ends up recording the capture without too much disruption to other captures.

I think it would make sense for the default maximum wait time to be an hour, under the assumption that if the amount of time is less than an hour then there would probably be a good reason for it. To be consistent with the other parts of the script the time would probably be stored in seconds.

AgostinoSturaro commented 12 months ago

I have faced this delay several times when manually requesting a capture from their UI. This often happens when snapshotting GitHub. It also seems like the pending capture still counts towards your max concurrent capture limit. If you want to implement this, please do not make it a default.

barkoder commented 12 months ago

If I do,

spn.sh -fail_completely_and_drop_the_capture_job_altogether_if_capture_will_start_in_delay_message_exceeds '2h' -nq -a 'S3KEY:S3PASS' -d 'skip_first_archive=1&capture_outlinks=1&capture_screenshot=1&if_not_archived_within=10000d&delay_wb_availability=1' hxxps://example.com/URL

and if I hypothetically get

The capture will start in ~2 hours, 0 minutes 1 second because our service is currently overloaded. You may close your browser window and the page will still be saved.

then spn.sh fails and drops the capture job immediately. Because 2h 0m 1s exceeds my specified time of 2h .

spn.sh does not track the job.
spn.sh does not sleep and wait.
spn.sh does not retry even once. It just fails completely.
For that particular URL.
spn.sh does not even need to record the attempted capture. I'll later check if the URL capture was successful using CDX.

If however, I hypothetically instead get the message

The capture will start in ~1 hours, 59 minutes 59 seconds because our service is currently overloaded. You may close your browser window and the page will still be saved.

Then continue to track the job in the background, wait for snapshot link, do everything as normal, as 1h 59m 59s is below my specified time of 2h.

I don't want it to be the default either @AgostinoSturaro . I want it to be an "opt-in" flag that a user manually specifies.

Hopefully a flag with a better name than -fail_completely_and_drop_the_capture_job_altogether_if_capture_will_start_in_delay_message_exceeds

As I mentioned in the other thread - I cancel the job, close the terminal, open a new terminal, archive literally the same link after a minute or so and it archives normally in under ~6-10 seconds.

I need this flag specifically to save links that are endangered. Because some URLs that I capture may not even exist in ~5-10 hours. So when the wayback machine finally gets around to actually capturing the URL, it will have attempted to capture a URL that's already 404ed.

I need this so that after spn.sh drops the capture job for that URL completely, and closes after reaching the end of my list of URLs, I'll deduplicate my list of URLs against my logged captures, and then I'll re-run spn.sh and redo whatever's still left to be captured on the list. Rinse and repeat until the list is done.

Hopefully this explanation is better. Let me know if you have any questions.

Thanks for your help @overcast07 . And for the script of course .

barkoder commented 11 months ago

Do you mean that the delayed captures should be treated as invalid or as failed?

Treat as invalid if time in the "capture will start in" message exceeds user specified time in a new flag.

I don't know why I didn't use the word "invalid" in my previous comment. Even though you asked a very simple question. Sorry for the long-windedness of my previous comment @overcast07 .

overcast07 / wayback-machine-spn-scripts

Reduce number of tries if wayback machine is overloaded #19