overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Set a retry limit #26

Open krackers opened 1 year ago

krackers commented 1 year ago

For some reason wayback seems to trip on certain URLs, and it always returns an error.

The target URL <url> returned a Bad Gateway error. (HTTP status=502).

This causes the job to keep retrying, even when it keeps getting the same error back. Ideally we should stop retrying after 5 failures or so. Interestingly this is a wayback issue since the url does eventually appear in the archives.

overcast07 commented 1 year ago

In practice, the script relies on SPN returning the "this URL has already been captured 10 times today" error message to limit the number of retries for URLs that always return a theoretically recoverable error. There is technically a limit that is already in the script, but it relies on failed.txt being exactly the same as the previous copy (i.e. the same URLs failed in the exact same order), which evidently doesn't work for many cases.

Currently the script does not explicitly track how many times each URL has failed, although it could be determined by counting the number of appearances of each URL in the failed*.txt files. I don't really understand awk code but I was able to find a Stack Exchange answer that could potentially be adapted to this use case.

In some cases it is possible that a URL would only be successfully archived after 6 or more attempts, though this would only happen if the probability of the captures failing is particularly high. Since there is technically already a measure built into SPN to prevent URLs from being endlessly retried, I would probably implement this as an optional flag where the user can choose the maximum number of retries (i.e. if -m 1 then after 1 retry the URLs are added to max-retries.log).