In the description of Part XI, I remarked, "We have not yet seen the system respond to the triggering of the global rate limit." Now we have :-) Last night, Internet Archive experienced some downtime and was swamped when they came back up.
Our system responded reasonably well to the outage: refraining from enqueuing additional uploads, staying well within our bucket- and api-specific limits, and, as IA's capacity returned to normal, recovering on its own.
This PR makes a number of small tweaks, intended to make that process work even more smoothly.
It catches a previously uncaught exception, occasionally raised when IA is unresponsive mid-upload.
It avoids queuing any more confirmation... tasks if the read-only IA queue contains any items... which indicates that IA is responding more slowly than usual, and which, if run, would necessarily queue duplicative tasks. (Since we check in on this every five minutes, and the tasks generally get buzzed through very quickly, a skip is not a big deal.)
It tweaks the retry behavior of confirmation... tasks: instead of retrying infinitely on ConnectionError, only retries up to settings.INTERNET_ARCHIVE_RETRY_FOR_CONFIRMATION_CONNECTION_ERROR times to reduce churn during an outage... It will check again in 5 minutes anyway. (We could do the same with upload attempts as well...... to be pondered.)
It tries to prevent the automatic retries of... something... that I saw going by in the logs, but that I couldn't trace to a particular line of code, by using a non-retrying HTTP adapter everywhere.
It also makes a number of small improvements to the /manage/stats page used for monitoring, including making the tabs sticky.
In the description of Part XI, I remarked, "We have not yet seen the system respond to the triggering of the global rate limit." Now we have :-) Last night, Internet Archive experienced some downtime and was swamped when they came back up.
Our system responded reasonably well to the outage: refraining from enqueuing additional uploads, staying well within our bucket- and api-specific limits, and, as IA's capacity returned to normal, recovering on its own.
This PR makes a number of small tweaks, intended to make that process work even more smoothly.
confirmation...
tasks if the read-only IA queue contains any items... which indicates that IA is responding more slowly than usual, and which, if run, would necessarily queue duplicative tasks. (Since we check in on this every five minutes, and the tasks generally get buzzed through very quickly, a skip is not a big deal.)confirmation...
tasks: instead of retrying infinitely onConnectionError
, only retries up tosettings.INTERNET_ARCHIVE_RETRY_FOR_CONFIRMATION_CONNECTION_ERROR
times to reduce churn during an outage... It will check again in 5 minutes anyway. (We could do the same with upload attempts as well...... to be pondered.)It also makes a number of small improvements to the /manage/stats page used for monitoring, including making the tabs sticky.