Docker container with HEALTHCHECK becomes unresponsive

hermandavid commented 3 years ago

Description

We have started using docker swarm in our deployments and we have noticed that some containers are becoming unresponsive when healthcheck is used. See the HEALTHCHECK instruction used below. When removing the healthcheck, the containers are running fine without crashing.

HEALTHCHECK --interval=10s --timeout=5s --start-period=30s \
  CMD curl --silent --show-error --max-time 1 http://localhost:3000/ping || exit 1

Steps to reproduce the issue:

Run new container
Wait enough time (sometimes minutes, sometimes hours)
Container becomes unresponsive

Describe the results you received:

level=warning msg="Health check for container d80fc76db2f8c78c036c42a51390e6166990af76e2b979758da5bb91a6bea507 error: context deadline exceeded"
level=error msg="stream copy error: reading from a closed fifo"
level=error msg="stream copy error: reading from a closed fifo"
evel=warning msg="Health check for container d80fc76db2f8c78c036c42a51390e6166990af76e2b979758da5bb91a6bea507 error: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:103: executing setns process caused: exit status 1: unknown"

Describe the results you expected:

Having a healthy container.

Additional information you deem important (e.g. issue happens only occasionally):

This does not happen on a systematic basis, sometimes containers are kept alive for minutes, sometimes for hours. We have multiple swarm clusters with the same configuration and we experience this issue on only some of them.

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.1
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        831ebea
 Built:             Tue Dec 15 04:34:48 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       f001486
  Built:            Tue Dec 15 04:32:45 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.0-docker)

Server:
 Containers: 12
  Running: 7
  Paused: 0
  Stopped: 5
 Images: 8
 Server Version: 20.10.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: vk2zgmcl17n0nhgdidrexv9tc
  Is Manager: true
  ClusterID: lg9505ca4hfwr0mhx40ycgppm
  Managers: 3
  Nodes: 3
  Default Address Pool: 10.254.0.0/16  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.1.5
  Manager Addresses:
   10.0.1.5:2377
   10.0.1.6:2377
   10.0.1.7:2377
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.19.0-12-cloud-amd64
 Operating System: Debian GNU/Linux 10 (buster)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.67GiB
 Name: mena01
 ID: RE6W:SJZI:NSSZ:UQWS:CP6K:OJC3:DHUX:RYPR:PQ5N:SXDZ:HHUL:4OT3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional environment details (AWS, VirtualBox, physical, etc.):

The VMs are running on Azure.

alwaysmpe commented 3 years ago

Bumping this, think I've encountered the same issue.

Possibly related to #36661

Deployed a stack as containers connected to an overlay network using docker compose (based on the reference one provided in the airflow docs). Several initial deployments were fine done across 6 physical (on prem) machines, performed with ansible. One deployment caused hang on 3 machines, wasn't entirely sure why, went home as was EOD, next morning machines were unresponsive as docker had created several thousand processes to run the healthcheck, hitting server process limit making restarting them tricky. Some machines running docker 20.10.7, others 20.10.8.

rabergerik commented 1 year ago

@hermandavid @alwaysmpe I know its been a long time, but did you ever get any further with this?

I'm currently experiencing this issue after several hours of execution. I'm running Azure IoT Edge modules in docker 20.10.14 on an EFLOW VM (CBL Mariner). The application is sending messages upstream to an IoT Hub in Azure. Health checks are running once a minute. From yesterdays logs, checks are starting at 2023-03-29T07:43:52Z. At 15:28 I start getting these at every health check:

Mar 29 15:28:45 ELX210922-EFLOW dockerd[17107]: time="2023-03-29T15:28:45.192132862Z" level=error msg="stream copy error: reading from a closed fifo"

The CPU of my VM starts running heavy and the last message received in the IoT Hub comes in at 16:23. Some notable events after that:

Mar 29 16:23:28 ELX210922-EFLOW systemd-networkd[710]: eth0: Could not set DHCPv4 route: Connection timed out Mar 29 16:23:29 ELX210922-EFLOW systemd-networkd[710]: eth0: Failed Mar 29 16:21:39 ELX210922-EFLOW dockerd[17107]: time="2023-03-29T16:21:39.591688219Z" level=warning msg="Health check for container 36257c67116fb841df86a419b3f960254c6cd937c8417f2651546a91a0fcf3dc error: context deadline exceeded" Mar 29 16:47:03 ELX210922-EFLOW systemd[1]: systemd-journald.service: Killing process 573 (systemd-journal) with signal SIGABRT.

... silence ...

Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: State 'stop-watchdog' timed out. Killing. Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Killing process 573 (systemd-journal) with signal SIGKILL. Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Failed with result 'watchdog'. Mar 29 18:06:23 ELX210922-EFLOW kernel: systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Mar 29 18:06:23 ELX210922-EFLOW kernel: Out of memory: Killed process 26019 (dotnet) total-vm:2763908kB, anon-rss:142648kB, file-rss:0kB, shmem-rss:0kB, UID:13623 pgtables:708kB oom_score_adj:0 Mar 29 18:06:23 ELX210922-EFLOW kernel: oom_reaper: reaped process 26019 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

After this, things are coming back up again, runs for some hours, then the same procedure. I've now removed the HEALTHCHECK statement from my dockerfile and will continue to monitor the system.

moby / moby

Docker container with HEALTHCHECK becomes unresponsive #41853