Open exurd opened 1 year ago
Just chiming in that I think this would be really slow.
If I wanted to do this on the text files I have myself, I would currently need to do this:
It takes around half an hour (or more) to hand-process each of those files. If I have multiple files, this would get tedious pretty quickly; merging and un-merging them would add two additional steps to the already large method (I wouldn't even know how to separate them after they get processed).
If the script could do this, it would not only make this method outdated, but it would also be quicker since it doesn't need to do every URL at once.
Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?
This should be already possible.
Use the -d
flag to set the if_not_archived_within
capture option, plus the -n option to avoid saving error pages.
See the SPN 2 Public API doc for the options.
Something like spn.sh -d 'if_not_archived_within=5y' -n
should only save working pages not saved within the last 5 years.
Try it out. There are even options for the outlinks.
"The capture will start in ~ seconds because we are doing a lot of captures of ~ ~ right now" When this message appears, it seems that the archive will be duplicated.
There should be an option that allows you to check in the Wayback Machine if it has already been archived. For example, if you have a bunch of text files and only want to send requests for URLs with no archived page (i.e. first archive of a page), this setting can help.
Other things to consider when adding this is how long ago should it check. Maybe the option can work by adding the option, and then the timestamp?