Closed rebeccacremona closed 1 week ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 69.62%. Comparing base (
d4747b8
) to head (6df01a5
). Report is 8 commits behind head on develop.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
See LIL-2877.
Since we integrated with the Scoop API in 2023, we have sporadically seen
save_scoop_capture
fail with aProtocolError
when attempting to download the WARC/WACZ file from the API: eitherConnectionResetError
, early on, or, more commonly recently,IncompleteRead
. Every so often, there is a flurry of errors for a few minutes, and then it resolves. Every so often, there is a standalone error. The incidents happen at different times of the day, and on different days. It might happen several days in a row, and then not again for weeks or months. The incidence picked up sharply in late September 2024.From reading around, I believe this is due to transient network problems; I have not heard any suggestions for solutions other than "check your internet connection" or "try again."
So, this PR... tries again.
It reuses our standard utility for retries, which retries with an exponential backoff, starting from 100ms delay. I arbitrarily set the number of retries to 3. Though that doesn't end up introducing much of a delay, I think that's okay for a first pass: since the API call itself takes time, there is an additional built-in delay.
I have not found a good way to simulate or reproduce the error locally, so it is merely a hypothesis that this will help.
If we decide to merge and deploy this, follow up would be: watch and see how things go for a few weeks. If we see occasional single failures (like, one every few days) or any longer incidents, we could consider bumping up the number of retries. If we don't see any longer incidents for several weeks, let's say, 2 months, then I would be convinced this mechanism is working, and not that, the problem simply hasn't recurred.