The Docker client has a read timeout for console output. We set this to be 5 minutes. This means if a scraper doesn't output anything for 5 minutes the background worker throws an exception: Docker::Error::TimeoutError: read timeout reached.
Normally this isn't a big deal. Sidekiq will just retry and it will finish up as usual in one of those retries. However it can be a problem if you have a long running scraper because Sidekiq will back off its retries, i.e. the scraper Docker run has finished and its container is stopped but the background job is backed off and won't retry for another few hours.
This has the effect of the job taking much longer than it needs to finish and also of unnecessarily taking a queue slot while it waits to retry and finish up.
Is there any reason not to set that timeout to 24 hours? That's the maximum amount of time a scraper can run so doesn't it make sense not to timeout for the possible duration of a run?
The Docker client has a read timeout for console output. We set this to be 5 minutes. This means if a scraper doesn't output anything for 5 minutes the background worker throws an exception:
Docker::Error::TimeoutError: read timeout reached
.Normally this isn't a big deal. Sidekiq will just retry and it will finish up as usual in one of those retries. However it can be a problem if you have a long running scraper because Sidekiq will back off its retries, i.e. the scraper Docker run has finished and its container is stopped but the background job is backed off and won't retry for another few hours.
This has the effect of the job taking much longer than it needs to finish and also of unnecessarily taking a queue slot while it waits to retry and finish up.