Open henare opened 8 years ago
There's a 100 failed jobs on the queue right now with most of them due to this error. So, definitely needs fixing.
This is what I've figured out so far. For containers where that error is occurring, they're all marked as having status "dead". When you go into the /var/lib/docker/containers
directory there is no directory for that container. So, what it looks like is something is going wrong with the container and it's not getting cleaned up fully. The container data is not there but there is still some reference to it in docker.
The simple workaround for the time being is to simply remove the dead containers and rerun the jobs and that seems to clear things out.
We'll need a bit more of a clue as to why this is happening in the first place
Yesterday I upgraded the docker server again. The problem seems to have gone away as far as I can tell now. Let's leave it for a few more days
It's still happening occasionally :-(
This is a big contributing factor to #1098. If it happens to a job it takes up a slot and just keeps retrying and will never finish. So if you get a few like this then slots just start filling up.
It's happening at the point in code where morph.io is trying to attach to a stopped container and finish the run.
Backtrace
View full backtrace and more info at honeybadger.io