hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

ECS - service is cycling tasks (containers) with 137 exit code - no obvious errors in CloudWatch #149

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Previously 137 errors have been due to an EC2 host for ECS running out of a disk space resource (usually spotted with df -h or finding hundreds of files filling /var/lib/docker/tmp/).

This time around, one of the services is restarting its tasks (containers) every 1-3 minutes, and recording "137" exit code, but when we look at CloudWatch logs for those tasks, there's no actual errors being recorded (only warnings and info).

Details of Investigation

SSH into the EC2 hosts and df -h reports plenty of free space on the visible volumes.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  1.5G  6.3G  19% /
devtmpfs        3.9G  132K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm

No files visible on /var/lib/docker/tmp.

But if we run sudo vgs, this doesn't look good:

$ sudo vgs
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  VG     #PV #LV #SN Attr   VSize  VFree  
  docker   1   1   0 wz--n- 22.00g 168.00m

Dig a bit deeper too - how many containers have been deployed but stopped (and are no longer needed?)

$ docker info
Containers: 335
 Running: 14
 Paused: 0
 Stopped: 321
Images: 16
...

You can confirm that a tiny fraction of the containers that Docker has available are actually running by comparing these two command outputs:

First, the number of running containers: docker ps | wc

Then the number of total containers: docker ps -a | wc

(In one case there were 14 vs 319, due to rapidly-failing-and-redeploying-containers from a couple of our projects.)

Non-Solutions

  1. Extend the unmounted docker storage volume on each EC2 host: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-storage-config.html (this didn't alleviate the 137 issues)
  2. Clean the excess docker crap out of that "hidden" volume: docker rm $(docker ps -a -q) docker rmi $(docker images -q) (These commands don't change the output of sudo vgs immediately)
MikeTheCanuck commented 6 years ago

Today's 137 nightmare:

bhgrant8 commented 6 years ago

@MikeTheCanuck are you able to run: docker system df?

MikeTheCanuck commented 6 years ago

So Michael Lange read up on vgs and lvs and has reported that our logical volume is not in fact full:

so I did some reading on these vgs and lvs commands we are running
and I think that the `sudo vgs` output is a red herring
if I understand correctly, that output is showing all volume groups and how much of the total storage block they are taking

MikeTheCanuck [12:07 PM]
Well then what is causing ECS to report 137 exit code when trying to start the next container instance?

Michael Lange [12:07 PM]
so that output just means docker is the only partition and it is partitioned for roughly everything

MikeTheCanuck [12:08 PM]
OK, that makes a certain kind of sense - another storage system on top of the Linux filesystem, just like databases have a “filesystem” inside their monstrous pre-allocated files…

Michael Lange [12:08 PM]
yeah
but the output for `sudo lvs` shows legitimate numbers
meaning 2A is 33% full and 2B is 58% full
and all the docker commands are just for detailed information

MikeTheCanuck [12:09 PM]
OK, that’s reassuring in a way…

Michael Lange [12:09 PM]
and do insane amounts of on demand disk IO and just take forever
but…that just puts us back to square one
MikeTheCanuck commented 6 years ago

The root cause of 137 appears to be that this is business as usual for ELB/ALB: https://forums.aws.amazon.com/message.jspa?messageID=643055#643055

i.e. when ALB detects a bad container, that doesn't pass health check, ALB itself is initiating the SIGKILL for that container, and letting ECS try again.

So the real solution? Get these jank containers fixed ASAP.