overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Print the Wayback Machine saved page URL #12

Open ghost opened 2 years ago

ghost commented 2 years ago

When saving a page from the web interface WM first redirects to https://web.archive.org/save/, then to (for instance) `https://web.archive.org/web/2021/https://github.com/overcast07/wayback-machine-spn-scripts/`

Is it possible to print the URL using the script? This page can be used to inspect how many snapshots have been saved — I'd consider it as useful statistical information especially when saving outlinks (an archive.org account required, I don't know if the page can be accessed via API):

Sign in to use extra features: "Save outlinks", "Save screen shot" and "My web archive".

overcast07 commented 2 years ago

I'm not quite sure what you want to do here. The output text of the script isn't clickable, so you'd still have to copy and paste the URL into the browser anyway. Would it be sufficient (assuming you want to look at the data for a small number of individual URLs) to keep a tab open at https://web.archive.org/web/*/ and then paste the URLs that are being archived into the browser one by one?

The exact number of previous snapshots isn't always a necessary metric – if the saved content is exactly the same as it was the last time, then it's archived as a warc/revisit which just records that the content of the URL was the same as the last time it was visited, so it's not just a duplicate file being stored (if the software can prevent it from happening). (I don't think SPN users really need to worry about wasted storage space anyway unless the scale is in, like, the hundreds of thousands of URLs, something large enough to make a dent in the overall numbers.) If your capture is the first of that URL it'll be recorded in success-json.log (search for first-archive), but I never wrote code for the script to print it to the console.

Depending on your use case it may be easier for you to query the Wayback CDX API (example, documentation), which would allow you to view results for a URL prefix rather than just a single URL, and filter the results by HTTP response code or file type and so on.

The outlinks function of the website is not currently something that the script has proper support for, and it's not the same thing as the -o and -x flags in the script. You can specify to use it with -d (I don't remember the exact command at the moment) if -a is used, but it won't show you any of the data for the captures of the outlinks. I never got around to coding support for it.

ghost commented 2 years ago

The output text of the script isn't clickable, so you'd still have to copy and paste the URL into the browser anyway.

Can be copied quite easily with tmux for instance.

Thanks for the reply, I was mainly after in the "outlinks results" page, which can be quite satisfying when to view when you've managed to create the first archive for dozens of blog posts.

ghost commented 1 year ago

May 4, 2023: Addition of -w flag; addition of "first archive" check;

The 'first archive' feature mostly completes the request — specifically what I was mostly after.