docker driver OOM killed detection flaky on cgroups v2

shoenig commented 2 years ago

This test has started failing a lot, maybe 22.04 related

=== RUN   TestDockerDriver_OOMKilled
    docker.go:36: Successfully connected to docker daemon running version 20.10.16+azure-2
    driver_test.go:2[421](https://github.com/hashicorp/nomad/runs/6595376614?check_suite_focus=true#step:4:422): not killed by OOM killer: Docker container exited with non-zero exit code: 137
--- FAIL: TestDockerDriver_OOMKilled (2.41s)

tgross commented 2 years ago

I've been able to reproduce this locally on a VM running ubuntu 22.04 (Jammy), which is configured with cgroups v2, but I've been unable to reproduce with any cgroups v1 configuration. With the following patch:

$ git diff
diff --git a/drivers/docker/handle.go b/drivers/docker/handle.go
index 0783bd901..c0f787a44 100644
--- a/drivers/docker/handle.go
+++ b/drivers/docker/handle.go
@@ -11,6 +11,7 @@ import (
        "time"

        "github.com/armon/circbuf"
+       "github.com/davecgh/go-spew/spew"
        docker "github.com/fsouza/go-dockerclient"
        "github.com/hashicorp/consul-template/signals"
        hclog "github.com/hashicorp/go-hclog"
@@ -239,6 +240,11 @@ func (h *taskHandle) run() {
        container, ierr := h.waitClient.InspectContainerWithOptions(docker.InspectContainerOptions{
                ID: h.containerID,
        })
+
+       scs := spew.NewDefaultConfig()
+       scs.DisableMethods = true
+       h.logger.Warn("container.State", "state", scs.Sdump(container.State))
+
        oom := false
        if ierr != nil {
                h.logger.Error("failed to inspect container", "error", ierr)

I can see the OOMKilled flag simply isn't being set:

2022-07-28T14:13:57.596Z [WARN]  docker/handle.go:246: docker: container.State: container_id=aac5d6b4620001d04314fd18bea623f5f767615778b88292b98706e4ad998263
  state=
  | (docker.State) {
  |  Status: (string) (len=6) "exited",
  |  Running: (bool) false,
  |  Paused: (bool) false,
  |  Restarting: (bool) false,
  |  OOMKilled: (bool) false,
  |  RemovalInProgress: (bool) false,
  |  Dead: (bool) false,
  |  Pid: (int) 0,
  |  ExitCode: (int) 137,
  |  Error: (string) "",
  |  StartedAt: (time.Time) {
  |   wall: (uint64) 485069344,
  |   ext: (int64) 63794614435,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  FinishedAt: (time.Time) {
  |   wall: (uint64) 515940237,
  |   ext: (int64) 63794614437,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  Health: (docker.Health) {
  |   Status: (string) "",
  |   FailingStreak: (int) 0,
  |   Log: ([]docker.HealthCheck) <nil>
  |  }
  | }

I traced this through go-dockerclient to moby/moby and ended finding this very similar test failure on the moby project: https://github.com/moby/moby/issues/41929. From there I dug down to this PR in containerd https://github.com/containerd/containerd/pull/6323 where the field they're reading from was changed. This was released in containerd v1.6.3 and v1.5.12. The moby project has some unusual vendoring setup going on but it looks like even the most recent tag for moby is pinned to containerd v1.5.0 + security fixes.

So we can't really fix this even by updating our moby dependency until upstream has done the same. I'm going to add a flag to the test to disable it in the case of cgroups v2 with a pointer to this issue, and then mark this issue as waiting for upstream. I'll also drop a note over in https://github.com/moby/moby/issues/41929 pointing them to the containerd PR.

tgross commented 2 years ago

https://github.com/hashicorp/nomad/pull/13928 will remove the test flake. I've updated the labels for this issue to make it clear this is a problem with the driver and not the test.

shoenig commented 2 years ago

Wow nice detective work @tgross, thanks

mr-karan commented 1 year ago

We recently faced an issue in prod (on cgroups v2, Ubuntu 22.04), where the docker container restarted just few seconds after a restart. We think a memory build-up by the app in such a short span of time is impossible (as the normal memory usage is ~1GB, while the limits are 24GB).

Just wanted to confirm, is the issue we faced related to the above issue being discussed?

shoenig commented 1 year ago

Hi @mr-karan, I believe a docker container can exit with 137 whether the issue is

the task itself consumed memory beyond its limit
the kernel is experiencing memory pressure and activates the cgroup oversubscription oom-killer, and chooses to kill containers
the docker process decides the host has insufficient memory

Usually at least one of docker logs or dmesg should be able to shed light on the underlying kill reason.

That said, I'm currently in the process of upgrading our docker libraries which are a couple of years old, and may contain bug fixes / improvements in this area.

hashicorp / nomad

docker driver OOM killed detection flaky on cgroups v2 #13119