prometheus-community / ecs_exporter

Prometheus exporter for Amazon Elastic Container Service (ECS)
Apache License 2.0
77 stars 19 forks source link

How to treat ecs_cpu_seconds_total #34

Closed jseiser closed 2 years ago

jseiser commented 2 years ago

I can not find a way to graph ecs_cpu_seconds_total like I can for any other exporter we have that is returning CPU seconds.

  1. https://www.robustperception.io/understanding-machine-cpu-usage/
  2. https://stackoverflow.com/questions/34923788/prometheus-convert-cpu-user-seconds-to-cpu-usage/34930574#34930574

Anything im trying, is returning very large numbers that I do not understand.

When i check the /metrics in my browser I get

ecs_cpu_seconds_total{container="heartbeat",cpu="0"} 2.3798033553e+08
ecs_cpu_seconds_total{container="heartbeat",cpu="1"} 2.37868708e+08
ecs_cpu_seconds_total{container="log_router",cpu="0"} 9.787324061e+07
ecs_cpu_seconds_total{container="log_router",cpu="1"} 9.13804651e+07
ecs_cpu_seconds_total{container="prom_exporter",cpu="0"} 3.971903977e+07
ecs_cpu_seconds_total{container="prom_exporter",cpu="1"} 4.042450038e+07

When i query in prom

ecs_cpu_seconds_total{container="heartbeat", cpu="0", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 247372377.12
ecs_cpu_seconds_total{container="heartbeat", cpu="1", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 247288853.64
ecs_cpu_seconds_total{container="log_router", cpu="0", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 101809511.73
ecs_cpu_seconds_total{container="log_router", cpu="1", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 94857416.9
ecs_cpu_seconds_total{container="prom_exporter", cpu="0", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 42798612.53
ecs_cpu_seconds_total{container="prom_exporter", cpu="1", ecs_cluster="Cluster01", ecs_task_id="5ccd12509b6545519a62604d624f44d0", ecs_task_version="10", instance="10.1.111.137:9779", job="Heartbeat", metrics_path="1m/metrics"} | 43040614.05

If I query rate(ecs_cpu_seconds_total[2m]) * 100

I get very large numbers, 304131.45438179886 and 376788.3762791756 for core 0 and 1 for the heartbeat container for instance.

If it matters, im using something close to this: https://github.com/dwp/docker-ecs-service-discovery Which is allowing prom to find the container.

The end goal is toi graph the CPU usage of each container.

SuperQ commented 2 years ago

So in the initial implementation we were told the output counter unit from the ECS metadata was in Jiffies. It's not documented at all in the AWS docs.

So I did some digging. It turns out the cpu_stats returned by ECS are actually just pass-through CPU stats from docker's API.

That API is also not documented very well.

But I finally found this in the upstream moby source.

// CPUUsage stores All CPU stats aggregated since container inception.
type CPUUsage struct {
    // Total CPU time consumed.
    // Units: nanoseconds (Linux)
    // Units: 100's of nanoseconds (Windows)
    TotalUsage uint64 `json:"total_usage"`

    // Total CPU time consumed per core (Linux). Not used on Windows.
    // Units: nanoseconds.
    PercpuUsage []uint64 `json:"percpu_usage,omitempty"`

    // Time spent by tasks of the cgroup in kernel mode (Linux).
    // Time spent by all container processes in kernel mode (Windows).
    // Units: nanoseconds (Linux).
    // Units: 100's of nanoseconds (Windows). Not populated for Hyper-V Containers.
    UsageInKernelmode uint64 `json:"usage_in_kernelmode"`

    // Time spent by tasks of the cgroup in user mode (Linux).
    // Time spent by all container processes in user mode (Windows).
    // Units: nanoseconds (Linux).
    // Units: 100's of nanoseconds (Windows). Not populated for Hyper-V Containers
    UsageInUsermode uint64 `json:"usage_in_usermode"`
}

So it turns out the data is in Nanoseconds, not Jiffies. This means the value is being converted incorrectly here.