overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
107 stars 11 forks source link

Fetch additional data resulting from SPN2 capture_outlinks function #23

Open overcast07 opened 1 year ago

overcast07 commented 1 year ago

Ideally, this script should be able to fetch data for outlinks captured using the server-side SPN2 outlinks function.

Any implementation of this would run into a particular challenge: polling the status API endpoint for a large number of outlinks could cause the server to return 429 errors if the rate of requests is too high. The overall rate of requests would have to be controlled in some way, accounting for the additional requests made.

One way to implement this would be to add a separate text file (spn2-outlinks.txt) to which outlink status IDs are added upon completion of the main capture job. A check for this file could be added at some point in the main while loops (the ones starting at lines 579, 609 and 680), and the child processes could be spawned from those loops. Importantly, this approach would allow the child processes to be immediate children of the main process, so they would be counted by jobs -p. The script would probably have to pause new job submissions while the child processes for the outlinks are spawned. A variable could be used to store remaining lines if the child processes for the outlinks are not spawned in one go.

Alternatively, this could be done within each capture() child process immediately after the status API endpoint returns a successful capture and the list of outlinks. However, this would not be visible to the currently implemented check on the number of child processes (i.e. jobs -p), and the rate of requests of all parts of the script would have to be slowed down to account for this (unless the status API endpoint was just checked really infrequently).

We would have to decide whether failed outlink captures should be retried. Presumably, the outlinks of these pages would not be collected, so they would have to be listed separately from the main failed.txt list. An extra variable would have to be passed to the capture function to indicate whether or not to set capture_outlinks=1.

This option would also need to interface appropriately with the -o, -x and -r options.

This idea was previously listed in the "Future plans" in README.md, but I've removed that section since it's basically outdated and no longer relevant.

overcast07 commented 1 year ago

The POST parameter job_id_outlinks would allow the data for all of the outlinks of a capture to be obtained at the same time. The rate limiting issue mentioned in the original post might not apply if this method is used. A list of pending captures would have to be stored, and the JSON would have to be parsed/split properly.