influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.57k stars 5.57k forks source link

Support for exposing container memory reservation in ECS #12271

Open rhowe opened 1 year ago

rhowe commented 1 year ago

Use Case

Currently the ecs input plugin exposes memory usage data and also the container/task memory limit. These are exposed via the docker stats endpoint. It would be extremely useful to also know about the memory reservation (soft limit) and (optional) CPU reservation associated with each container, as this is one of the primary drivers for resource allocation within ECS and can help detect over/underprovisioning. As far as I can tell this information is not available from the ECS task metadata, but could be parsed from the task definition on startup, given appropriate IAM permissions. Since task definitions are immutable, this would be a one-off operation. The soft limit is also available from 'docker inspect' under the MemoryReservation value although I don't think that's easily accessible from within the task.

Expected behavior

An ecs_container_mem_reservation metric with the value in bytes of the container reservation, and an ecs_container_cpu_reservation metric with the number of CPU units reserved for the container (value of 0 if no reservation is set?)

Actual behavior

This data is not exposed currently

Additional info

I'm going to raise a feature request with AWS to request that these values are exposed in the container stats endpoint. That would obviously be the best solution, although if they do add it, it may require container metadata v4 support in telegraf.

rhowe commented 1 year ago

AWS feature request: https://github.com/aws/containers-roadmap/issues/1893

powersj commented 1 year ago

Hi,

Thanks for opening the upstream feature request.

Outside of that feature request landing, you mentioned the possibility of collecting this data via reading the task definition itself. What endpoint and APIs would need to be hit for this?

rhowe commented 1 year ago

It would require calling the AWS ECS API's DescribeTaskDefinition method with the ARN of the task definition (which I think can be read from the ECS task metadata)

tkrafael commented 1 year ago

Is it possible to expose cpu limit from docker container? Currently, cpu usage from inputs.docker and ecs service metrics are different and confusing. Also, inputs.docker plugin needs to be aware of cgroups. I did a small experiment, running a docker container in my machine and read a docker inspect in an ecs container.

In regular docker containers, cpuquota, cpuset, cpuperiod and other cpu* information are filled with docker run parameters. In ecs containers, cpu control is made by using cgroup: image

That means docker input need to be aware of all those variables, look where cgroup filesystem is mounted to be able to read this information and send it in form of metrics to telegraf outputs. I'm able to modify it and PR this, but need some advice to change in correct location

tkrafael commented 1 year ago

BTW, I just found an ecs input that might work correctly. I'll take a look and see if it suffices.