openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
461 stars 74 forks source link

Scrapers with no console output can take an unnecessarily long time to finish running #1123

Open henare opened 7 years ago

henare commented 7 years ago

The Docker client has a read timeout for console output. We set this to be 5 minutes. This means if a scraper doesn't output anything for 5 minutes the background worker throws an exception: Docker::Error::TimeoutError: read timeout reached.

Normally this isn't a big deal. Sidekiq will just retry and it will finish up as usual in one of those retries. However it can be a problem if you have a long running scraper because Sidekiq will back off its retries, i.e. the scraper Docker run has finished and its container is stopped but the background job is backed off and won't retry for another few hours.

This has the effect of the job taking much longer than it needs to finish and also of unnecessarily taking a queue slot while it waits to retry and finish up.

henare commented 7 years ago

Is there any reason not to set that timeout to 24 hours? That's the maximum amount of time a scraper can run so doesn't it make sense not to timeout for the possible duration of a run?