openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
462 stars 74 forks source link

Backend throws Docker::Error::NotFoundError errors occasionally since recent upgrades #1034

Open henare opened 8 years ago

henare commented 8 years ago

It's happening at the point in code where morph.io is trying to attach to a stopped container and finish the run.

[Morph/production] Docker::Error::NotFoundError: open /var/lib/docker/containers/1bdb967ce0f3933068d77cf3d1b93411f29ec6067d556b30f555464c3778894d/1bdb967ce0f3933068d77cf3d1b93411f29ec6067d556b30f555464c3778894d-json.log: no such file or directory

Backtrace

line 118 of [PROJECT_ROOT]/lib/morph/docker_runner.rb: attach_to_run_and_finish
line 103 of [PROJECT_ROOT]/lib/morph/runner.rb: attach_to_run_and_finish
line 50 of [PROJECT_ROOT]/lib/morph/runner.rb: go

View full backtrace and more info at honeybadger.io

mlandauer commented 8 years ago

There's a 100 failed jobs on the queue right now with most of them due to this error. So, definitely needs fixing.

mlandauer commented 8 years ago

This is what I've figured out so far. For containers where that error is occurring, they're all marked as having status "dead". When you go into the /var/lib/docker/containers directory there is no directory for that container. So, what it looks like is something is going wrong with the container and it's not getting cleaned up fully. The container data is not there but there is still some reference to it in docker.

The simple workaround for the time being is to simply remove the dead containers and rerun the jobs and that seems to clear things out.

We'll need a bit more of a clue as to why this is happening in the first place

mlandauer commented 8 years ago

Yesterday I upgraded the docker server again. The problem seems to have gone away as far as I can tell now. Let's leave it for a few more days

mlandauer commented 8 years ago

It's still happening occasionally :-(

henare commented 7 years ago

This is a big contributing factor to #1098. If it happens to a job it takes up a slot and just keeps retrying and will never finish. So if you get a few like this then slots just start filling up.