overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Is spn.sh broken at the moment? Did SPN API get changed in some way? #32

Open barkoder opened 10 months ago

barkoder commented 10 months ago

Some URLs like https://i.ytimg.com/vi/VIDEOID/maxresdefault.jpg get saved via spn.sh

Some others like https://www.youtube.com/watch?v=VIDEOID say the capture will start in 10 hours!!! or 15 hours!!! Because they're overloaded apparently.

But I try the same URLS using a regular web browser via https://web.archive.org/save and it starts normally. The capture will start in time, is less than a minute.
So they're "overloaded" by 10-15 hours only for curl users.

I don't understand why the long capture time when it's done through curl.

Other times I keep ketting a Request Failed error, but when I check SPN user status using curl -s -H 'Accept: application/json' -H 'Authorization: LOW S3KEY:S3PASS' https://web.archive.org/save/status/user , the "daily_captures" keeps on ticking up for every Request Failed

I tried exporting the headers and cookies manually -b cookiesfile.txt and merged it to my .curlrc and it seemed to work for a very brief moment(5-10 minutes). I got a Job submitted and Job completed for a few captures but it stopped working later. Went back to giving me Request Failed again.

What is going on? Is the internet archive de-prioritizing curl users? Or is something broken on my computer.

List of things I've tried.

Please help. I don't know what to do. The issue has been there for the past 2 days. curl version - 8.4.0

EDIT: Please confirm if you're also having this issue.

Thanks!

brandongalbraith commented 10 months ago

Some others like https://www.youtube.com/watch?v=VIDEOID say the capture will start in 10 hours!!! or 15 hours!!! Because they're overloaded apparently.

Have you tried https://github.com/bibanon/tubeup for multimedia artifacts?

barkoder commented 10 months ago

Have you tried https://github.com/bibanon/tubeup for multimedia artifacts?

I'm not archiving multimedia.

I'm only archiving the html page of a youtube video. I only want a record of a video having existed at some point on youtube.

Whether the wayback machine later chooses to archive the video itself(located on googlevideo CDN) is up to them.

Also want to mention that's not the only example of captures taking 10-15 hours. It's happening on reddit threads as well.

brandongalbraith commented 10 months ago

@barkoder Thanks for that context. I believe Wayback does have per site queues to prevent hitting sites too hard, which could be why you're seeing the latency you describe between submitting an archival request and when it is performed.

More information at https://archive.org/developers/tasks.html

barkoder commented 10 months ago

@brandongalbraith I am aware they have those queues. If that is the case, why then are the captures successful through the WebUI in under a minute?

This issue(of ~10-15hours capture time) occurs even if I'm only archiving a single URL through spn.sh with -p 1

Also are you using spn.sh? Are you able to reproduce this issue?

brandongalbraith commented 10 months ago

I use https://github.com/palewire/savepagenow dockerized and my archive requests mostly immediately return the Wayback snapshot url lately (although within the last week I have seen a spike in unknown error events where my workers had to re-queue the tasks for re-attempt).

barkoder commented 10 months ago

Well that confirms it. spn.sh is indeed broken. I tried https://github.com/palewire/savepagenow and it's returning the snapshots.

Unfortunately spn.sh is significantly more powerful due to its easy job parallelization and lower memory usage, thanks to curl.

Still thanks for the alternative @brandongalbraith , even if only to validate that spn.sh is broken.

Help! Please! @overcast07 @otuva

barkoder commented 10 months ago

Okay I think I just figured it out. Using -d 'force_get=1' gets the capture time down to under a minute.

From the API documentation

force_get=1 - Force the use of a simple HTTP GET request to capture the target URL. By default SPN2 does a HTTP HEAD on the target URL to decide whether to use a headless browser or a simple HTTP GET request. force_get overrides this behavior.

This method is probably suboptimal for saving javascript-heavy pages.

I think the wayback machine is having problems allocating headless browsers for capture jobs at the moment. Hence the ~10-15 hour delay. At least that's my guess.

But for now at least, I appear to be able to spn.sh again. Let's see how long this lasts. I may update this ticket later.

overcast07 commented 10 months ago

If the Internet Archive decides to throttle requests, that's not really something that we can do much about, and it's probably for a good reason. It's also usually temporary; at time of writing the wait time has already decreased to around 6 hours.

However, in this case, the throttling is apparently applied only to POST requests which do not have the header Referer: https://web.archive.org/save. As a temporary measure, you could avoid the throttling by using -c "-H 'Referer: https://web.archive.org/save'", but I'm not sure whether it would be appropriate to hardcode this into the script.

For us, other possible options include:

otuva commented 10 months ago

Tried couple of videos, they all said ~25 seconds and completed around 2 min each

barkoder commented 9 months ago

Now those long capture duration messages are appearing seemingly randomly.

The capture will start in ~4 seconds
The capture will start in ~5 seconds
The capture will start in ~6 seconds
The capture will start in ~10 hours 40 minutes
The capture will start in ~5 seconds
The capture will start in ~6 seconds

It's weird. I cancel the job, archive the same link after a couple minutes and it archives in under ~6-10 seconds.

This is a problem as it's stalling the archival of long lists of URLs. See my latest comment on #19 .