prometheus-community / ecs_exporter

Prometheus exporter for Amazon Elastic Container Service (ECS)
Apache License 2.0
83 stars 21 forks source link

Feature: Expose cpu_stats.throttling_data #41

Open jseiser opened 2 years ago

jseiser commented 2 years ago

Information is contained here: curl -o stats.json "${ECS_CONTAINER_METADATA_URI_V4}/task/stats"

        "cpu_stats": {
            "cpu_usage": {
                "total_usage": 1666419070,
                "percpu_usage": [
                    676748195,
                    989670875
                ],
                "usage_in_kernelmode": 170000000,
                "usage_in_usermode": 1010000000
            },
            "system_cpu_usage": 7028500000000,
            "online_cpus": 2,
            "throttling_data": {
                "periods": 0,
                "throttled_periods": 0,
                "throttled_time": 0
            }
        },
kvendingoldo commented 2 years ago

Hi @jseiser, I've started to work on metrics set update in my PR: https://github.com/prometheus-community/ecs_exporter/pull/46. Welcome to participate in that.

isker commented 1 month ago

Has anyone seen a situation where these are non-zero? It should notionally be possible if any CPU limits are set on the task or the container, but even with such things in place I do not see anything moving. My expectation is that periods would, well, periodically increase if those conditions are met, even if no throttling is taking place.

I can try to experiment with creating a while (true) {} container to see if I can make the stats move when throttling is actually happening.

isker commented 1 month ago

I ran tasks with both ecs_exporter and an alpine sidecar running ["/bin/sh", "-c", "yes > /dev/null"] (i.e. chewing up a lot of CPU) on Fargate and EC2. They both had less than 1 vCPU allocated, Fargate at the task level and EC2 at the container level. The CPU-seconds metrics for both were definitely increasing slower than real time passed, indicating that throttling was occurring. The builtin CloudWatch graphs available in the AWS console also indicated that these services were using all available CPU.

But the throttling container stats remained at 0. I'm not sure why, but regardless I think this is a dead end without action from AWS.