Open hermandavid opened 3 years ago
Bumping this, think I've encountered the same issue.
Possibly related to #36661
Deployed a stack as containers connected to an overlay network using docker compose (based on the reference one provided in the airflow docs). Several initial deployments were fine done across 6 physical (on prem) machines, performed with ansible. One deployment caused hang on 3 machines, wasn't entirely sure why, went home as was EOD, next morning machines were unresponsive as docker had created several thousand processes to run the healthcheck, hitting server process limit making restarting them tricky. Some machines running docker 20.10.7, others 20.10.8.
@hermandavid @alwaysmpe I know its been a long time, but did you ever get any further with this?
I'm currently experiencing this issue after several hours of execution. I'm running Azure IoT Edge modules in docker 20.10.14 on an EFLOW VM (CBL Mariner). The application is sending messages upstream to an IoT Hub in Azure. Health checks are running once a minute. From yesterdays logs, checks are starting at 2023-03-29T07:43:52Z. At 15:28 I start getting these at every health check:
Mar 29 15:28:45 ELX210922-EFLOW dockerd[17107]: time="2023-03-29T15:28:45.192132862Z" level=error msg="stream copy error: reading from a closed fifo"
The CPU of my VM starts running heavy and the last message received in the IoT Hub comes in at 16:23. Some notable events after that:
Mar 29 16:23:28 ELX210922-EFLOW systemd-networkd[710]: eth0: Could not set DHCPv4 route: Connection timed out Mar 29 16:23:29 ELX210922-EFLOW systemd-networkd[710]: eth0: Failed Mar 29 16:21:39 ELX210922-EFLOW dockerd[17107]: time="2023-03-29T16:21:39.591688219Z" level=warning msg="Health check for container 36257c67116fb841df86a419b3f960254c6cd937c8417f2651546a91a0fcf3dc error: context deadline exceeded" Mar 29 16:47:03 ELX210922-EFLOW systemd[1]: systemd-journald.service: Killing process 573 (systemd-journal) with signal SIGABRT.
... silence ...
Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: State 'stop-watchdog' timed out. Killing. Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Killing process 573 (systemd-journal) with signal SIGKILL. Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL Mar 29 18:06:23 ELX210922-EFLOW systemd[1]: systemd-journald.service: Failed with result 'watchdog'. Mar 29 18:06:23 ELX210922-EFLOW kernel: systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Mar 29 18:06:23 ELX210922-EFLOW kernel: Out of memory: Killed process 26019 (dotnet) total-vm:2763908kB, anon-rss:142648kB, file-rss:0kB, shmem-rss:0kB, UID:13623 pgtables:708kB oom_score_adj:0 Mar 29 18:06:23 ELX210922-EFLOW kernel: oom_reaper: reaped process 26019 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
After this, things are coming back up again, runs for some hours, then the same procedure. I've now removed the HEALTHCHECK statement from my dockerfile and will continue to monitor the system.
Description
We have started using docker swarm in our deployments and we have noticed that some containers are becoming unresponsive when healthcheck is used. See the
HEALTHCHECK
instruction used below. When removing the healthcheck, the containers are running fine without crashing.Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Having a healthy container.
Additional information you deem important (e.g. issue happens only occasionally):
This does not happen on a systematic basis, sometimes containers are kept alive for minutes, sometimes for hours. We have multiple swarm clusters with the same configuration and we experience this issue on only some of them.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
The VMs are running on Azure.