Open barkoder opened 10 months ago
Some others like https://www.youtube.com/watch?v=VIDEOID say the capture will start in 10 hours!!! or 15 hours!!! Because they're overloaded apparently.
Have you tried https://github.com/bibanon/tubeup for multimedia artifacts?
Have you tried https://github.com/bibanon/tubeup for multimedia artifacts?
I'm not archiving multimedia.
I'm only archiving the html page of a youtube video. I only want a record of a video having existed at some point on youtube.
Whether the wayback machine later chooses to archive the video itself(located on googlevideo CDN) is up to them.
Also want to mention that's not the only example of captures taking 10-15 hours. It's happening on reddit threads as well.
@barkoder Thanks for that context. I believe Wayback does have per site queues to prevent hitting sites too hard, which could be why you're seeing the latency you describe between submitting an archival request and when it is performed.
More information at https://archive.org/developers/tasks.html
@brandongalbraith I am aware they have those queues. If that is the case, why then are the captures successful through the WebUI in under a minute?
This issue(of ~10-15hours capture time) occurs even if I'm only archiving a single URL through spn.sh with -p 1
Also are you using spn.sh? Are you able to reproduce this issue?
I use https://github.com/palewire/savepagenow dockerized and my archive requests mostly immediately return the Wayback snapshot url lately (although within the last week I have seen a spike in unknown error events where my workers had to re-queue the tasks for re-attempt).
Well that confirms it. spn.sh is indeed broken. I tried https://github.com/palewire/savepagenow and it's returning the snapshots.
Unfortunately spn.sh is significantly more powerful due to its easy job parallelization and lower memory usage, thanks to curl.
Still thanks for the alternative @brandongalbraith , even if only to validate that spn.sh is broken.
Help! Please! @overcast07 @otuva
Okay I think I just figured it out. Using -d 'force_get=1'
gets the capture time down to under a minute.
From the API documentation
force_get=1 - Force the use of a simple HTTP GET request to capture the target URL. By default SPN2 does a HTTP HEAD on the target URL to decide whether to use a headless browser or a simple HTTP GET request. force_get overrides this behavior.
This method is probably suboptimal for saving javascript-heavy pages.
I think the wayback machine is having problems allocating headless browsers for capture jobs at the moment. Hence the ~10-15 hour delay. At least that's my guess.
But for now at least, I appear to be able to spn.sh again. Let's see how long this lasts. I may update this ticket later.
If the Internet Archive decides to throttle requests, that's not really something that we can do much about, and it's probably for a good reason. It's also usually temporary; at time of writing the wait time has already decreased to around 6 hours.
However, in this case, the throttling is apparently applied only to POST requests which do not have the header Referer: https://web.archive.org/save
. As a temporary measure, you could avoid the throttling by using -c "-H 'Referer: https://web.archive.org/save'"
, but I'm not sure whether it would be appropriate to hardcode this into the script.
For us, other possible options include:
Tried couple of videos, they all said ~25 seconds and completed around 2 min each
Now those long capture duration messages are appearing seemingly randomly.
The capture will start in ~4 seconds
The capture will start in ~5 seconds
The capture will start in ~6 seconds
The capture will start in ~10 hours 40 minutes
The capture will start in ~5 seconds
The capture will start in ~6 seconds
It's weird. I cancel the job, archive the same link after a couple minutes and it archives in under ~6-10 seconds.
This is a problem as it's stalling the archival of long lists of URLs. See my latest comment on #19 .
Some URLs like https://i.ytimg.com/vi/VIDEOID/maxresdefault.jpg get saved via spn.sh
Some others like https://www.youtube.com/watch?v=VIDEOID say the capture will start in 10 hours!!! or 15 hours!!! Because they're overloaded apparently.
But I try the same URLS using a regular web browser via https://web.archive.org/save and it starts normally. The
capture will start in
time, is less than a minute.So they're "overloaded" by 10-15 hours only for curl users.
I don't understand why the long capture time when it's done through curl.
Other times I keep ketting a
Request Failed
error, but when I check SPN user status usingcurl -s -H 'Accept: application/json' -H 'Authorization: LOW S3KEY:S3PASS' https://web.archive.org/save/status/user
, the"daily_captures"
keeps on ticking up for everyRequest Failed
I tried exporting the headers and cookies manually
-b cookiesfile.txt
and merged it to my.curlrc
and it seemed to work for a very brief moment(5-10 minutes). I got aJob submitted
andJob completed
for a few captures but it stopped working later. Went back to giving meRequest Failed
again.What is going on? Is the internet archive de-prioritizing curl users? Or is something broken on my computer.
List of things I've tried.
.curlrc
. Issue still there.-a
and the issue is still there.Please help. I don't know what to do. The issue has been there for the past 2 days. curl version - 8.4.0
EDIT: Please confirm if you're also having this issue.
Thanks!