Open overcast07 opened 1 year ago
The POST parameter job_id_outlinks
would allow the data for all of the outlinks of a capture to be obtained at the same time. The rate limiting issue mentioned in the original post might not apply if this method is used. A list of pending captures would have to be stored, and the JSON would have to be parsed/split properly.
Ideally, this script should be able to fetch data for outlinks captured using the server-side SPN2 outlinks function.
Any implementation of this would run into a particular challenge: polling the status API endpoint for a large number of outlinks could cause the server to return 429 errors if the rate of requests is too high. The overall rate of requests would have to be controlled in some way, accounting for the additional requests made.
One way to implement this would be to add a separate text file (
spn2-outlinks.txt
) to which outlink status IDs are added upon completion of the main capture job. A check for this file could be added at some point in the main while loops (the ones starting at lines 579, 609 and 680), and the child processes could be spawned from those loops. Importantly, this approach would allow the child processes to be immediate children of the main process, so they would be counted byjobs -p
. The script would probably have to pause new job submissions while the child processes for the outlinks are spawned. A variable could be used to store remaining lines if the child processes for the outlinks are not spawned in one go.Alternatively, this could be done within each
capture()
child process immediately after the status API endpoint returns a successful capture and the list of outlinks. However, this would not be visible to the currently implemented check on the number of child processes (i.e.jobs -p
), and the rate of requests of all parts of the script would have to be slowed down to account for this (unless the status API endpoint was just checked really infrequently).We would have to decide whether failed outlink captures should be retried. Presumably, the outlinks of these pages would not be collected, so they would have to be listed separately from the main
failed.txt
list. An extra variable would have to be passed to thecapture
function to indicate whether or not to setcapture_outlinks=1
.This option would also need to interface appropriately with the
-o
,-x
and-r
options.This idea was previously listed in the "Future plans" in README.md, but I've removed that section since it's basically outdated and no longer relevant.