nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.71k stars 151 forks source link

Container running with sysbox-runc runtime not responding #529

Open OriGlassman opened 2 years ago

OriGlassman commented 2 years ago

Hi,

We're running several containers, where only one of them is run with sysbox-runc runtime. After some time, the container that is running with the sysbox-runc runtime seem to not work properly, and can't be closed (docker rm -f just hangs until timeout is reached). Meanwhile, the other containers running with the normal runc runtime works properly.

System information:

Supplying the logs for sysbox-mgr as it does state an error unlike sysbox-fs. Debug logs from sysbox-mgr.log:

time="2022-04-08 14:00:57" level=info msg="released resources for container 91674898f90d"
time="2022-04-08 14:01:04" level=debug msg="dockerVolMgr: created volume for container 8ec38dddb942"
time="2022-04-08 14:01:04" level=debug msg="k3sVolMgr: created volume for container 8ec38dddb942"
time="2022-04-08 14:01:05" level=debug msg="rke2VolMgr: created volume for container 8ec38dddb942"
time="2022-04-08 14:01:05" level=debug msg="autoRemoveCheck: Docker query start for 8ec38dddb942"
time="2022-04-08 14:01:08" level=debug msg="FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1
time="2022-04-08 14:01:08" level=debug msg="autoRemoveCheck: Docker query for 8ec38dddb942 failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2022-04-08 14:01:08" level=debug msg="marked shiftfs for /var/lib/docker/overlay2/138b4468e27813eee3e9190b5ed2ca3eece8035c10bac04ea4a1b0bb5b9c5ad8/merged at /var/lib/sysbox/shiftfs/021301a4-d28b-4d0c-a3de-06c12efb7c02"
time="2022-04-08 14:04:22" level=debug msg="unmarked shiftfs for /var/lib/docker/overlay2/138b4468e27813eee3e9190b5ed2ca3eece8035c10bac04ea4a1b0bb5b9c5ad8/merged at /var/lib/sysbox/shiftfs/021301a4-d28b-4d0c-a3de-06c12efb7c02"
time="2022-04-08 14:04:22" level=debug msg="kubeletVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:22" level=debug msg="k0sVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:22" level=debug msg="buildkitVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:22" level=debug msg="containerdVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:23" level=debug msg="dockerVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:23" level=debug msg="k3sVolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:23" level=debug msg="rke2VolMgr: sync'd-out volume for container 8ec38dddb942"
time="2022-04-08 14:04:23" level=debug msg="added fs watch on /var/lib/docker/overlay2/138b4468e27813eee3e9190b5ed2ca3eece8035c10bac04ea4a1b0bb5b9c5ad8"
time="2022-04-08 14:04:23" level=info msg="unregistered container 8ec38dddb942"
time="2022-04-08 14:04:24" level=debug msg="rootfsMon: rm on /var/lib/docker/overlay2/138b4468e27813eee3e9190b5ed2ca3eece8035c10bac04ea4a1b0bb5b9c5ad8"
time="2022-04-08 14:04:24" level=debug msg="kubeletVolMgr: destroyed volume for container 8ec38dddb942"
time="2022-04-08 14:04:24" level=debug msg="k0sVolMgr: destroyed volume for container 8ec38dddb942"
time="2022-04-08 14:04:24" level=debug msg="buildkitVolMgr: destroyed volume for container 8ec38dddb942"
time="2022-04-08 14:04:24" level=debug msg="containerdVolMgr: destroyed volume for container 8ec38dddb942"

Core dump of sysbox-mgr: https://easyupload.io/fg546r

rodnymolina commented 2 years ago

@OriGlassman, thanks for filing this issue.

Question. Is this container that appears in the logs above (8ec38dddb942) the one where problem was observed? If that's the case, I wonder if there was something special being done with the dockerd daemon at the time these logs were reported. This log indicates that Sysbox was unable to interact with dockerd during the container initialization, which is something that we don't usually see in docker-based setup like yours, so wondering if this has anything to do with this abnormal behavior.

time="2022-04-08 14:01:08" level=debug msg="FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1 time="2022-04-08 14:01:08" level=debug msg="autoRemoveCheck: Docker query for 8ec38dddb942 failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 

A few more questions:

Thanks.

OriGlassman commented 2 years ago

Question. Is this container that appears in the logs above (8ec38dddb942) the one where problem was observed? If that's the case, I wonder if there was something special being done with the dockerd daemon at the time these logs were reported. This log indicates that Sysbox was unable to interact with dockerd during the container initialization, which is something that we don't usually see in docker-based setup like yours, so wondering if this has anything to do with this abnormal behavior.

We're running an environment where we every few minutes remove the container and then start a new one. There are a couple more consecutive messages like these. I guess that after a few times our code that is responsible for creating the new container fails (it says due to timeout with removing the container) and then it stops trying. When trying manually with docker cli, it's indeed in a state where the remove just blocks and nothing happens.

  • Was this a one-off event? Can you reproduce the issue?

A reboot to the VM releases the blocking situation. It happened several times so we can reproduce it (although the time to reproduce may vary as there isn't seem to be any special event that occurs prior to the problematic behaviour).

  • Which Sysbox release are you running?

v0.5.0

ctalledo commented 2 years ago

I agree that this error is very likely the smoking gun:

level=debug msg="autoRemoveCheck: Docker query for 8ec38dddb942 failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

This means that sysbox-mgr could not query the docker daemon, which suggests the docker daemon is not working properly. I don't believe we've ever seen this error before. The reason sysbox-mgr queries the docker daemon is to check if the container will be removed automatically after it's stopped (e.g., if it was launched with --rm), as sysbox uses an optimization when that's the case.

Question: after the error occurs, can you launch new docker containers (e.g., with the standard runc)?

OriGlassman commented 2 years ago

Question: after the error occurs, can you launch new docker containers (e.g., with the standard runc)?

Yes

ctalledo commented 2 years ago

Hi @OriGlassman,

Looking at the log you posted:

time="2022-04-08 14:01:05" level=debug msg="autoRemoveCheck: Docker query start for 8ec38dddb942" 
time="2022-04-08 14:01:08" level=debug msg="FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1 
time="2022-04-08 14:01:08" level=debug msg="autoRemoveCheck: Docker query for 8ec38dddb942 failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

it's clear that Sysbox is having trouble connecting to Docker engine via its API, though I don't quite understand why.

The FIXME message provides a clue: this message comes from docker engine, here.

This tells me that Docker engine is responding with an error that is not one of the "known" errors. Not sure why though.

A couple of follow ups:

1) Looks like your host is a Ubuntu Focal VM on AWS, with kernel 5.13, correct?

2) What version of docker engine are you running on the host (i.e., docker version)?

OriGlassman commented 2 years ago
  1. Looks like your host is a Ubuntu Focal VM on AWS, with kernel 5.13, correct?

Correct

  1. What version of docker engine are you running on the host (i.e., docker version)?

`Client: Docker Engine - Community Version: 20.10.14 API version: 1.41 Go version: go1.16.15 Git commit: a224086 Built: Thu Mar 24 01:48:02 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.9 API version: 1.41 (minimum version 1.12) Go version: go1.16.8 Git commit: 79ea9d3 Built: Mon Oct 4 16:06:37 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.5.11 GitCommit: 3df54a852345ae127d1fa3092b95168e4a88e2f8 runc: Version: 1.0.3 GitCommit: v1.0.3-0-gf46b6ba docker-init: Version: 0.19.0 GitCommit: de40ad0 `

ctalledo commented 2 years ago

Hi @OriGlassman,

I tried reproducing the problem using the exact same Docker CLI + Engine version you posted above, but was unable to do so. In fact I ran the entire Sysbox test suite (> 100 tests) using that Docker CLI + Engine versions and don't see any sign of the error you are getting.

This, combined with the fact that we've never seen a report of this error before, leads me to think this is an issue specific to your environment, though it's not clear to me what triggers it.

Looking again at the sysbox-mgr log you posted above, it seems the Docker daemon is not responding to the sysbox-mgr's query for 3 secs, at which point the sysbox-mgr times-out and reports the error.

time="2022-04-08 14:01:05" level=debug msg="autoRemoveCheck: Docker query start for 8ec38dddb942"                                                                                                                                                                                                                             
time="2022-04-08 14:01:08" level=debug msg="FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1                                                                                                                                                                      
time="2022-04-08 14:01:08" level=debug msg="autoRemoveCheck: Docker query for 8ec38dddb942 failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"                                                                                                                         

What caused the Docker engine to not respond? I've no idea.

The reason you don't see this with the regular runc is that it never queries Docker back for any info, but Sysbox does. Nonetheless, I don't believe there is a bug in Sysbox, as the evidence points to Docker not responding via it's API. We just have to figure out why.

Question: when the failure occurs, what other operations was Docker doing at the time? I wonder if it's busy with something and therefore not responding on its API.

OriGlassman commented 2 years ago

In one environment there were other containers (with runc runtime) which docker restart occasionally and in other environment the only container was with sysbox-runc runtime so basically it shouldn't be busy doing other operations

ctalledo commented 2 years ago

Hi @OriGlassman, thanks.

Is there anything in the Docker log (e.g., sudo journalctl -u docker) around the time of the failure (time="2022-04-08 14:01:08") that would give a hint on why Docker was not responding via it's API?