Retry WACZ download - Githubissues

See LIL-2877.

Since we integrated with the Scoop API in 2023, we have sporadically seen save_scoop_capture fail with a ProtocolError when attempting to download the WARC/WACZ file from the API: either ConnectionResetError, early on, or, more commonly recently, IncompleteRead. Every so often, there is a flurry of errors for a few minutes, and then it resolves. Every so often, there is a standalone error. The incidents happen at different times of the day, and on different days. It might happen several days in a row, and then not again for weeks or months. The incidence picked up sharply in late September 2024.

From reading around, I believe this is due to transient network problems; I have not heard any suggestions for solutions other than "check your internet connection" or "try again."

So, this PR... tries again.

It reuses our standard utility for retries, which retries with an exponential backoff, starting from 100ms delay. I arbitrarily set the number of retries to 3. Though that doesn't end up introducing much of a delay, I think that's okay for a first pass: since the API call itself takes time, there is an additional built-in delay.

I have not found a good way to simulate or reproduce the error locally, so it is merely a hypothesis that this will help.

If we decide to merge and deploy this, follow up would be: watch and see how things go for a few weeks. If we see occasional single failures (like, one every few days) or any longer incidents, we could consider bumping up the number of retries. If we don't see any longer incidents for several weeks, let's say, 2 months, then I would be convinced this mechanism is working, and not that, the problem simply hasn't recurred.

harvard-lil / perma

Retry WACZ download #3656

Codecov Report