Open henare opened 7 years ago
The task has been temporarily disabled and I'm going to assign @auxesis to look into this.
We currently have around 20 Sidekiq jobs waiting to retry thanks to #1123. However they have no corresponding containers, running or stopped. I think that what might have happened is that their stopped containers have been deleted while they were waiting to retry. If this is the case then when they retry they will actually restart the whole scraping job, creating a new container, and presumably hitting the same problem over and over. This will mean they can never finish.
I've checked today and there is a corresponding Docker container for every job currently in retry due to Docker::Error::TimeoutError: read timeout reached
. So that indicates the above is correct.
In #1112 we started to run
docker system prune --all --force
every hour. This removes all images not in use, not just dangling ones.That certainly sounds like it will remove images we want, such as those built from the buildpack base image (and the base image itself!). These images are caches commonly used language and library combinations and significantly speed up scraper run times over having to compile and install everything for each run.
Having this cache busted has the effect of making scrapers take longer to run for people as the base language and libraries will need to be downloaded, compiled, and installed more often. Another, lesser, problem is that it also means more server CPU, memory, and bandwidth is consumed doing these things repeatedly.
A more important issue is something @equivalentideas metioned:
if this is the case it will be causing problems for scraper runs. The Docker documentation is unclear if this command actually does this but we're observing a bug that would be consistent with that.
We currently have around 20 Sidekiq jobs waiting to retry thanks to #1123. However they have no corresponding containers, running or stopped. I think that what might have happened is that their stopped containers have been deleted while they were waiting to retry. If this is the case then when they retry they will actually restart the whole scraping job, creating a new container, and presumably hitting the same problem over and over. This will mean they can never finish.
Console output also supports this theory, e.g.: