overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Feature: ability to only save pages that haven't been archived yet #30

Open exurd opened 1 year ago

exurd commented 1 year ago

There should be an option that allows you to check in the Wayback Machine if it has already been archived. For example, if you have a bunch of text files and only want to send requests for URLs with no archived page (i.e. first archive of a page), this setting can help.

Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?

TheTechRobo commented 1 year ago

Just chiming in that I think this would be really slow.

exurd commented 1 year ago

If I wanted to do this on the text files I have myself, I would currently need to do this:

  1. Turn the text file into a Google Sheet
  2. Put it into the "Batch process Google Sheets using archive.org services" app with the "Check if URLs are archived in the Wayback Machine" feature
  3. Export from Google Sheets to a CSV
  4. Convert that CSV to a usable text file (with the URLs only)
  5. And then finally sending it into spn.sh

It takes around half an hour (or more) to hand-process each of those files. If I have multiple files, this would get tedious pretty quickly; merging and un-merging them would add two additional steps to the already large method (I wouldn't even know how to separate them after they get processed).

If the script could do this, it would not only make this method outdated, but it would also be quicker since it doesn't need to do every URL at once.

AgostinoSturaro commented 1 year ago

Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?

This should be already possible. Use the -d flag to set the if_not_archived_within capture option, plus the -n option to avoid saving error pages. See the SPN 2 Public API doc for the options. Something like spn.sh -d 'if_not_archived_within=5y' -n should only save working pages not saved within the last 5 years. Try it out. There are even options for the outlinks.

NoodlesStamps commented 1 year ago

"The capture will start in ~ seconds because we are doing a lot of captures of ~ ~ right now" When this message appears, it seems that the archive will be duplicated.