Open shoenig opened 2 years ago
I've been able to reproduce this locally on a VM running ubuntu 22.04 (Jammy), which is configured with cgroups v2, but I've been unable to reproduce with any cgroups v1 configuration. With the following patch:
$ git diff
diff --git a/drivers/docker/handle.go b/drivers/docker/handle.go
index 0783bd901..c0f787a44 100644
--- a/drivers/docker/handle.go
+++ b/drivers/docker/handle.go
@@ -11,6 +11,7 @@ import (
"time"
"github.com/armon/circbuf"
+ "github.com/davecgh/go-spew/spew"
docker "github.com/fsouza/go-dockerclient"
"github.com/hashicorp/consul-template/signals"
hclog "github.com/hashicorp/go-hclog"
@@ -239,6 +240,11 @@ func (h *taskHandle) run() {
container, ierr := h.waitClient.InspectContainerWithOptions(docker.InspectContainerOptions{
ID: h.containerID,
})
+
+ scs := spew.NewDefaultConfig()
+ scs.DisableMethods = true
+ h.logger.Warn("container.State", "state", scs.Sdump(container.State))
+
oom := false
if ierr != nil {
h.logger.Error("failed to inspect container", "error", ierr)
I can see the OOMKilled
flag simply isn't being set:
2022-07-28T14:13:57.596Z [WARN] docker/handle.go:246: docker: container.State: container_id=aac5d6b4620001d04314fd18bea623f5f767615778b88292b98706e4ad998263
state=
| (docker.State) {
| Status: (string) (len=6) "exited",
| Running: (bool) false,
| Paused: (bool) false,
| Restarting: (bool) false,
| OOMKilled: (bool) false,
| RemovalInProgress: (bool) false,
| Dead: (bool) false,
| Pid: (int) 0,
| ExitCode: (int) 137,
| Error: (string) "",
| StartedAt: (time.Time) {
| wall: (uint64) 485069344,
| ext: (int64) 63794614435,
| loc: (*time.Location)(<nil>)
| },
| FinishedAt: (time.Time) {
| wall: (uint64) 515940237,
| ext: (int64) 63794614437,
| loc: (*time.Location)(<nil>)
| },
| Health: (docker.Health) {
| Status: (string) "",
| FailingStreak: (int) 0,
| Log: ([]docker.HealthCheck) <nil>
| }
| }
I traced this through go-dockerclient to moby/moby and ended finding this very similar test failure on the moby project: https://github.com/moby/moby/issues/41929. From there I dug down to this PR in containerd https://github.com/containerd/containerd/pull/6323 where the field they're reading from was changed. This was released in containerd
v1.6.3 and v1.5.12. The moby project has some unusual vendoring setup going on but it looks like even the most recent tag for moby is pinned to containerd
v1.5.0 + security fixes.
So we can't really fix this even by updating our moby dependency until upstream has done the same. I'm going to add a flag to the test to disable it in the case of cgroups v2 with a pointer to this issue, and then mark this issue as waiting for upstream. I'll also drop a note over in https://github.com/moby/moby/issues/41929 pointing them to the containerd PR.
https://github.com/hashicorp/nomad/pull/13928 will remove the test flake. I've updated the labels for this issue to make it clear this is a problem with the driver and not the test.
Wow nice detective work @tgross, thanks
We recently faced an issue in prod (on cgroups v2, Ubuntu 22.04), where the docker container restarted just few seconds after a restart. We think a memory build-up by the app in such a short span of time is impossible (as the normal memory usage is ~1GB, while the limits are 24GB).
Just wanted to confirm, is the issue we faced related to the above issue being discussed?
Hi @mr-karan, I believe a docker container can exit with 137 whether the issue is
Usually at least one of docker logs
or dmesg
should be able to shed light on the underlying kill reason.
That said, I'm currently in the process of upgrading our docker libraries which are a couple of years old, and may contain bug fixes / improvements in this area.
This test has started failing a lot, maybe 22.04 related