overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
100 stars 9 forks source link

URLs skipped due to if_not_archived_within option treated as failures #33

Open AnyOldName3 opened 4 months ago

AnyOldName3 commented 4 months ago

I'm trying to archive the missing parts of the website for an open-source library so it can be replaced by a new one without losing any information to history. As I know much of the site is already on the Wayback Machine, and that it's not been changed recently, I've used the if_not_archived_within option with a large value to avoid unnecessarily archiving another copy of things that don't need it.

spn.sh is treating the responses it gets back, e.g. The same snapshot had been made n hours, n minutes ago. You can make new capture of this URL after n hours., as failures, but as far as I'm concerned, this is a success - the page is in the archive.

This isn't a massive problem on its own, as I could just let the script run on my URL list, but I've had a series of failures, and had to use the resume feature, and each time I do, the script decides it must attempt all these URLs again, wasting a bunch of time, and increasing the chances that another failure will knock it out part way again.

I don't really mind whether it's a flag or the default behaviour, but it'd be preferable if the same snapshot response was considered a success.

overcast07 commented 4 months ago

I would say it's still worth it to make a capture to confirm that a page hasn't been changed recently. We probably aren't making the slightest dent in the server hosting costs of the Internet Archive, given that they archive several terabytes of content every day just from SPN URL submissions.*

I would need to investigate further to determine what the issue actually is. Oddly, I think if if_not_archived_within is unset and the server returns The same snapshot had been made n minutes ago, it does get counted as a successful capture.

* Direct SPN captures and outlinks captures accounted for about 1,100 TB and 500 TB of data respectively in 2023, assuming that each item in the linked collections contains about 10 GB of WARC files.

AnyOldName3 commented 4 months ago

I would say it's still worth it to make a capture to confirm that a page hasn't been changed recently.

In my specific case, this would be a waste as there's a hit counter provided by the CMS, so it'd always be marked as having changed, and most (but not all) of the site was subject to a frequent crawl anyway, and I was just aiming for the bits that'd been missed.

In the end (i.e. a few minutes ago, despite originally starting weeks ago) I just made a Python script grab the success lists and the right lines from the invalid logs from the ~/spn-data subdirectories and combine them into a final success list, and used that to filter the input URL list. That let me kick off the script again with a smaller list of targets, and it's already put a dent in them.

overcast07 commented 4 months ago

Yeah, I usually do something like that myself if I have to remove URLs that resulted in failed captures from the URL list, although it's not ideal since the script is theoretically not supposed to require much technical competence to use.