overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Very good script! Thanks for your work! There is one question about Log #9

Closed fireindark707 closed 1 year ago

fireindark707 commented 2 years ago

Many thanks for this script, I think it contributes a lot to the web archive. I am using it now and it works very well most of the time. But I have recently discovered some possible issues. Some URLs exist in the success.log, but when I try to access them directly using the Internet Archive site, the site shows that they are not indexed. I'm not sure why that is. But at the same time, there are many links that are successfully included. I'll keep testing to determine what's going on (IA site latency?)

overcast07 commented 2 years ago

Sometimes it takes 12 to 24 hours for the URL to appear in the Wayback Machine and in the Wayback CDX API. This is normal and expected behavior, but is not documented. The SPN2 API also has a flag, delay_wb_availability=1, that disables the immediate indexing (and would allow you to intentionally replicate the delay that you've been experiencing) and can be enabled in the script using option -d 'delay_wb_availability=1'.

The capture becomes available in the Wayback Machine after ~12 hours instead of immediately. This option helps reduce the load on our systems. All API responses remain exactly the same when using this option.

fireindark707 commented 2 years ago

Thank you for your reply, I would like to ask where the SPN2 API document comes from? I didn't find it on the IA website.

overcast07 commented 2 years ago

It isn't linked on the IA website. It was first mentioned on Twitter in 2019 by a staff member.