ECS - service is cycling tasks (containers) with 137 exit code - no obvious errors in CloudWatch

MikeTheCanuck commented 6 years ago

Previously 137 errors have been due to an EC2 host for ECS running out of a disk space resource (usually spotted with df -h or finding hundreds of files filling /var/lib/docker/tmp/).

This time around, one of the services is restarting its tasks (containers) every 1-3 minutes, and recording "137" exit code, but when we look at CloudWatch logs for those tasks, there's no actual errors being recorded (only warnings and info).

Details of Investigation

SSH into the EC2 hosts and df -h reports plenty of free space on the visible volumes.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  1.5G  6.3G  19% /
devtmpfs        3.9G  132K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm

No files visible on /var/lib/docker/tmp.

But if we run sudo vgs, this doesn't look good:

$ sudo vgs
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  VG     #PV #LV #SN Attr   VSize  VFree  
  docker   1   1   0 wz--n- 22.00g 168.00m

Dig a bit deeper too - how many containers have been deployed but stopped (and are no longer needed?)

$ docker info
Containers: 335
 Running: 14
 Paused: 0
 Stopped: 321
Images: 16
...

You can confirm that a tiny fraction of the containers that Docker has available are actually running by comparing these two command outputs:

First, the number of running containers: docker ps | wc

Then the number of total containers: docker ps -a | wc

(In one case there were 14 vs 319, due to rapidly-failing-and-redeploying-containers from a couple of our projects.)

Non-Solutions

Extend the unmounted docker storage volume on each EC2 host: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-storage-config.html (this didn't alleviate the 137 issues)
Clean the excess docker crap out of that "hidden" volume: docker rm $(docker ps -a -q) docker rmi $(docker images -q) (These commands don't change the output of sudo vgs immediately)

MikeTheCanuck commented 6 years ago

Today's 137 nightmare:

containers are deploying to ECS but STOPPED immediately with the 137 exit code
I've already run docker rm $(docker ps -a -q) and docker rmi $(docker images -q)

when I run sudo vgs after that, they're both still reporting < 200 MB free space, e.g.

$ sudo vgs
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
VG     #PV #LV #SN Attr   VSize  VFree  
docker   1   1   0 wz--n- 22.00g 188.00m

among the output of docker info is an indication that "Data Space Available" far exceeds this number, as even does "Thin Pool Minimum Free Space":

[ec2-user@ip-10-180-35-77 ~]$ docker info
Containers: 35
Running: 13
Paused: 0
Stopped: 22
Images: 16
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: ext4
Udev Sync Supported: true
Data Space Used: 12.24GB
Data Space Total: 23.35GB
Data Space Available: 11.11GB
Metadata Space Used: 8.016MB
Metadata Space Total: 41.94MB
Metadata Space Available: 33.93MB
Thin Pool Minimum Free Space: 2.335GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
...

this article clued me in to look at the contents of /var/lib/docker/devicemapper/, and I just happened to randomly stumble on a directory that looks like it's holding ~12 GB of space (if I'm reading this right):

$ sudo ls -la /var/lib/docker/devicemapper/mnt/0b7c868ef6e01f494e5f000813d77dd34820afaa6c9ed00ee01d62d512885fe8/rootfs
total 12328
drwxr-xr-x 24 root  root      4096 May 31 06:29 .
drwxr-xr-x  4 root  root      4096 May 28 16:04 ..
drwxr-xr-x  2 root  root      4096 Feb 27  2017 bin
drwxr-xr-x  2 root  root      4096 Dec 28  2016 boot
drwxr-xr-x  6 root  root      4096 May 31 06:29 code
drwxr-xr-x  4 root  root      4096 May 31 06:29 dev
-rwxr-xr-x  1 root  root         0 May 31 06:29 .dockerenv
drwxr-xr-x 58 root  root      4096 May 31 06:29 etc
drwxr-xr-x 19 14349 users     4096 Mar 23  2017 gdal-1.11.0
-rw-r--r--  1 root  root  10707286 Apr 16  2014 gdal-1.11.0.tar.gz
drwxr-xr-x 12   106 65534     4096 Mar 23  2017 geos-3.4.2
-rw-r--r--  1 root  root   1813726 Aug 25  2013 geos-3.4.2.tar.bz2
...

I tried to rm -rf this directory but was denied:

$ sudo rm -rf /var/lib/docker/devicemapper/mnt/0b7c868ef6e01f494e5f000813d77dd34820afaa6c9ed00ee01d62d512885fe8
rm: cannot remove ‘/var/lib/docker/devicemapper/mnt/0b7c868ef6e01f494e5f000813d77dd34820afaa6c9ed00ee01d62d512885fe8’: Device or resource busy

bhgrant8 commented 6 years ago

@MikeTheCanuck are you able to run: docker system df?

MikeTheCanuck commented 6 years ago

So Michael Lange read up on vgs and lvs and has reported that our logical volume is not in fact full:

so I did some reading on these vgs and lvs commands we are running
and I think that the `sudo vgs` output is a red herring
if I understand correctly, that output is showing all volume groups and how much of the total storage block they are taking

MikeTheCanuck [12:07 PM]
Well then what is causing ECS to report 137 exit code when trying to start the next container instance?

Michael Lange [12:07 PM]
so that output just means docker is the only partition and it is partitioned for roughly everything

MikeTheCanuck [12:08 PM]
OK, that makes a certain kind of sense - another storage system on top of the Linux filesystem, just like databases have a “filesystem” inside their monstrous pre-allocated files…

Michael Lange [12:08 PM]
yeah
but the output for `sudo lvs` shows legitimate numbers
meaning 2A is 33% full and 2B is 58% full
and all the docker commands are just for detailed information

MikeTheCanuck [12:09 PM]
OK, that’s reassuring in a way…

Michael Lange [12:09 PM]
and do insane amounts of on demand disk IO and just take forever
but…that just puts us back to square one

MikeTheCanuck commented 6 years ago

The root cause of 137 appears to be that this is business as usual for ELB/ALB: https://forums.aws.amazon.com/message.jspa?messageID=643055#643055

i.e. when ALB detects a bad container, that doesn't pass health check, ALB itself is initiating the SIGKILL for that container, and letting ECS try again.

So the real solution? Get these jank containers fixed ASAP.

hackoregon / civic-devops

ECS - service is cycling tasks (containers) with 137 exit code - no obvious errors in CloudWatch #149

Details of Investigation

Non-Solutions