Docker pruning is busting image cache and possibly destroying containers in use

henare commented 7 years ago

In #1112 we started to run docker system prune --all --force every hour. This removes all images not in use, not just dangling ones.

That certainly sounds like it will remove images we want, such as those built from the buildpack base image (and the base image itself!). These images are caches commonly used language and library combinations and significantly speed up scraper run times over having to compile and install everything for each run.

Having this cache busted has the effect of making scrapers take longer to run for people as the base language and libraries will need to be downloaded, compiled, and installed more often. Another, lesser, problem is that it also means more server CPU, memory, and bandwidth is consumed doing these things repeatedly.

A more important issue is something @equivalentideas metioned:

the regular docker prune -af is destroying all [stopped] containers

if this is the case it will be causing problems for scraper runs. The Docker documentation is unclear if this command actually does this but we're observing a bug that would be consistent with that.

We currently have around 20 Sidekiq jobs waiting to retry thanks to #1123. However they have no corresponding containers, running or stopped. I think that what might have happened is that their stopped containers have been deleted while they were waiting to retry. If this is the case then when they retry they will actually restart the whole scraping job, creating a new container, and presumably hitting the same problem over and over. This will mean they can never finish.

Console output also supports this theory, e.g.:

Injecting configuration and compiling...
Injecting scraper and running...
Injecting configuration and compiling...
Injecting scraper and running...
Injecting configuration and compiling...
Injecting scraper and running...
Injecting configuration and compiling...
Injecting scraper and running...
Injecting configuration and compiling...
Injecting scraper and running...

henare commented 7 years ago

The task has been temporarily disabled and I'm going to assign @auxesis to look into this.

henare commented 7 years ago

We currently have around 20 Sidekiq jobs waiting to retry thanks to #1123. However they have no corresponding containers, running or stopped. I think that what might have happened is that their stopped containers have been deleted while they were waiting to retry. If this is the case then when they retry they will actually restart the whole scraping job, creating a new container, and presumably hitting the same problem over and over. This will mean they can never finish.

I've checked today and there is a corresponding Docker container for every job currently in retry due to Docker::Error::TimeoutError: read timeout reached. So that indicates the above is correct.

openaustralia / morph

Docker pruning is busting image cache and possibly destroying containers in use #1124