overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
101 stars 9 forks source link

Bash scripts for Wayback Machine Save Page Now

spn.sh, a Bash script that asks the Internet Archive Wayback Machine's Save Page Now (SPN) to save live web pages

Notes: This project is not affiliated with the Internet Archive. Server-side changes are periodically made to the SPN service by the Internet Archive, so the script's behavior can become outdated quickly. Older revisions of the script are not supported and may not work.

Introduction

Features

Motivation

There exist several alternatives to this script, including the Python program savepagenow by pastpages and the wayback-gsheets interface on archive.org. However, in terms of functionality, this script has some advantages over other methods of sending captures to the Wayback Machine.

The information on this page is focused on the script itself. More information about Save Page Now can be found at the draft SPN2 public API documentation by Vangelis Banos.

spn.sh

Installation

Download the script and make it executable with the command chmod a+x spn.sh. The script can be run directly and does not need to be compiled to a binary executable beforehand.

Arch Linux

On Arch Linux, this script is also available as an AUR package and you can install it with your favorite AUR helper.

yay -S wayback-spn-script-git

Dependencies

This script is written in Bash and has been tested using the shell environment preinstalled binaries on macOS 10.14, macOS 12 and Windows 10 WSL Ubuntu. As far as possible, utilities have been used in ways in which behavior is consistent for both their GNU and BSD implementations. (The use of sed -E in particular may be a problem for older versions of GNU sed, but otherwise there should be no major compatibility issues.)

Operation

The only required input (unless resuming a previous session) is the first argument, which can be a file name or URL. If the file doesn't exist or if there's more than one argument, then the input is treated as a set of URLs. A sub-subfolder of ~/spn-data is created when the script starts, and logs and data are stored in the new folder. Some information is also sent to the standard output. All dates and times are in UTC.

The main list of URLs is stored in memory. Periodically, URLs for failed capture jobs and outlinks are added to the list. When there are no more URLs, the script terminates.

The script can be terminated from the command prompt with Ctrl+C (or by other methods like the task manager or the kill command). If this is done, no more capture jobs will be started, although active capture jobs may continue to run for a few minutes.

The script may sometimes not output to the console or to the log files for an extended period. This can occur if Save Page Now introduces a delay for captures of a specific domain, though typically the delay is only around a few minutes at most. If you're on Windows, make sure it isn't just PowerShell.

Using the -q flag (log fewer JSON responses) is recommended in order to save disk space during typical usage. -n (don't save error pages) is also recommended unless it is important to archive particular error pages.

Usage examples

Basic usage

Submit URLs from the command line.

spn.sh https://example.com/page1/ https://example.com/page2/

If this doesn't work, try specifying the file path of the script. For example, if you move the script into your home folder:

~/spn.sh https://example.com/page1/ https://example.com/page2/

Submit URLs from a text file containing one URL per line.

spn.sh urls.txt
Run jobs in parallel

Keep at most 15 capture jobs active at the same time. (The server-side rate limit may come into effect before reaching this limit.)

spn.sh -p 15 urls.txt

Don't run capture jobs in parallel. Start no more than one capture every 60 seconds.

spn.sh -p 1 -w 60 urls.txt
Save outlinks

Save all outlinks, outlinks of outlinks, and so on. (The script continues until either there are no more URLs or the script is terminated by the user.)

spn.sh -o '' https://example.com/

Save outlinks matching either youtube or reddit, except those matching facebook.

spn.sh -o 'youtube|reddit' -x 'facebook' https://example.com/

Save outlinks to the subdomain fr.wikipedia.org.

spn.sh -o 'https?://fr\.wikipedia\.org(/|$)' https://example.com/

Flags

Options:
 -a auth        S3 API keys, in the form accesskey:secret
                (get account keys at https://archive.org/account/s3.php)

 -c args        pass additional arguments to curl

 -d data        capture request options, or other arbitrary POST data

 -f folder      use a custom location for the data folder
                (some files will be overwritten or deleted during the session)

 -i suffix      add a suffix to the name of the data folder
                (if -f is used, -i is ignored)

 -n             tell Save Page Now not to save errors into the Wayback Machine

 -o pattern     save detected capture outlinks matching regex (ERE) pattern

 -p N           run at most N capture jobs in parallel (default: 20)

 -q             discard JSON for completed jobs instead of writing to log file

 -r folder      resume with the remaining URLs of an aborted session
                (settings are not carried over, except for outlinks options)

 -s             use HTTPS for all captures and change HTTP input URLs to HTTPS

 -t N           wait at least N seconds before updating the main list of URLs
                with outlinks and failed capture jobs (default: 3600)

 -w N           wait at least N seconds after starting a capture job before
                starting another capture job (default: 2.5)

 -x pattern     save detected capture outlinks not matching regex (ERE) pattern
                (if -o is also used, outlinks are filtered using both regexes)

All flags should be placed before arguments, but flags may be used in any order. If a string contains characters that need to be escaped in Bash, wrap the string in quotes; e.g. -x '&action=edit'.

Data files

The .txt files in the data folder of the running script may be modified manually to affect the script's operation, excluding old versions of those files which have been renamed to have a Unix timestamp in the title.

The .log files in the data folder of the running script do not affect the script's operation. The files will not be created until they receive data (which may include blank lines). They are updated continuously until the script finishes. If the script is aborted, the files may continue receiving data from capture jobs for a few minutes. Log files that contain JSON are not themselves valid JSON, but can be converted to valid JSON with the command sed -e 's/}$/},/g' <<< "[$(<filename.log)]" > filename.json.

Additional usage examples

Outlinks

Save outlinks to all subdomains of example.org.

spn.sh -o 'https?://([^/]+\.)?example\.org(/|$)' https://example.com/

Save outlinks to example.org/files/ and all its subdirectories, except for links with the file extension .mp4.

spn.sh -o 'https?://(www\.)?example\.org/files/' -x '\.mp4(\?|$)'  https://example.com/

Save outlinks matching YouTube video URLs.

spn.sh -o 'https?://(www\.|m\.)?youtube\.com/watch\?(.*\&)?v=[a-zA-Z0-9_-]{11}|https?://youtu\.be/[a-zA-Z0-9_-]{11}' https://example.com/

Save outlinks matching MediaFire file download URLs, and update the URL list as frequently as possible so that the outlinks can be captured before they expire.

spn.sh -t 0 -o 'https?://download[0-9]+\.mediafire\.com/' https://www.mediafire.com/file/a28veehw21gq6dc

Save subdirectories and files in an IPFS folder, visiting each file twice (replace the example folder URL with that of the folder to be archived).

spn.sh -o 'https?://cloudflare-ipfs\.com/ipfs/(QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn/.+|[a-zA-Z0-9]{46}\?filename=)' https://cloudflare-ipfs.com/ipfs/QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

Changelog