Kubernetes Pod / ECS Task Level cgroup metrics with labels

eightnoteight commented 4 years ago

Kubernetes Pod level & ECS Task level metrics are unfortunately exposed only with their cgroup id, whereas the cgroups of individual containers get the benefits of added container labels

Currently we are solving this problem in prometheus by doing a join over the pod level metrics with matched child container's metrics with the pod name metrics

for example, we are currently using below query to get the total cpu percentage in use at a pod level

# problem: how to get used cpu% of a task
# solution:

label_replace(
  (
    (rate(container_cpu_user_seconds_total{cluster_arn=~"[[cluster]]"}[5m]) + rate(container_cpu_system_seconds_total{cluster_arn=~"[[cluster]]"}[5m]))
    /
    (container_spec_cpu_quota{cluster_arn=~"[[cluster]]"} / container_spec_cpu_period{cluster_arn=~"[[cluster]]"})
  ) * 100,
  "id",
  "$3",
  "id",
  "/(.*)/(.*)/(.*)"
) +

on(id) (
  sum by (id) (
    label_replace(
      container_start_time_seconds{
        image != "",
        name!="ecs-agent",
        cluster_arn=~"[[cluster]]",
        container_label_com_amazonaws_ecs_task_definition_family=~"[[task_family]]"
      },
      "id",
      "$3",
      "id",
      "/(.*)/(.*)/(.*)/(.*)"
    ) * 0
   )
)

we are matching the 3rd level of directory of pod's cgroup path available in container_cpu_user_seconds_total and 3rd level of directory of container's cgroup path available in container_start_time_seconds

because the task's(pod in eks) cgroup path is /ecs/<cluster-name>/<task-id> and an example container inside the task cgroup path is /ecs/<cluster-name>/<task-id>/<container-id>, we are able to get the pod level cpu percentage metrics

I'm looking for a way in cadvisor to export some of the child cgroup path's docker labels to the parent cgroup path's metrics. one approach I had in mind was to declare the labels as a cmdline option(-docker_cgroup_extract_labels=container_label_com_amazonaws_ecs_task_arn,container_label_io_kubernetes_pod_name) that can be extracted from child cgroup path's container labels and assigned to the immediate cgroup parents

so our end goal is a simplified query for the above problem

# problem: how to get used cpu% of a task
# solution:

(
  rate(container_cpu_user_seconds_total{cluster_arn=~"[[cluster]]", container_label_com_amazonaws_ecs_task_arn=~".*cart-service.*"}[5m]) 
    + rate(container_cpu_system_seconds_total{cluster_arn=~"[[cluster]]", , container_label_com_amazonaws_ecs_task_arn=~".*cart-service.*"}[5m])
)
/
(
  container_spec_cpu_quota{cluster_arn=~"[[cluster]]", container_label_com_amazonaws_ecs_task_arn=~".*cart-service.*"}
    / container_spec_cpu_period{cluster_arn=~"[[cluster]]", container_label_com_amazonaws_ecs_task_arn=~".*cart-service.*"}
)

dashpole commented 4 years ago

I wonder if it would be easier to join with something like kube_pod_info from KSM using the pod UID...

Container metrics usually have pod labels attached to them. Can you just aggregate the container metrics, or does that not work for your use-case?

eightnoteight commented 4 years ago

I wonder if it would be easier to join with something like kube_pod_info from KSM using the pod UID...

Container metrics usually have pod labels attached to them. Can you just aggregate the container metrics, or does that not work for your use-case?

hi @dashpole, thanks for the quick reply on this, yes that works for my use case but adds about same level of complexity, infact I'm using an equivalent solution in ecs i.e using container_start_time_seconds instead of the kube_pod_info, hypothetically the kube_pod_info exposes the pod uid which is a substring in the cadvisor path, so I can remove one label_replace to reduce the query length, but the end result still is a bit high on the complexity side, one other thing I want to point out was that even though this simplifies for kubernetes, simplifying this in ecs would require a similar tool(like kube-state-metrics) to be built

I haven't tested the below query as we currently haven't deployed kube-state-metrics,

label_replace(
  (
    (rate(container_cpu_user_seconds_total{}[5m]) + rate(container_cpu_system_seconds_total{}[5m]))
    /
    (container_spec_cpu_quota{} / container_spec_cpu_period{})
  ) * 100,
  "uid",
  "$3",
  "id",
  "/(.*)/(.*)/pod(.*)"
) +

on(uid) (
  sum by (uid) (
      kube_pod_info{
        created_by_kind="Deployment"
        created_by_name="[[deployment]]"
      } * 0
   )
)

basically I'm extracting the pod-uid in the cgroup path provided in id label and joining with the kube_pod_info metric, as both share the uid after doing that label_replace

One other thing I've missed out in the start was query performance degradation because of join (theoritically enriching the metrics ahead should help)

because of that we are currently working on enriching these metrics during ingestion using prometheus recording rules. but still I'm looking for a straight forward built-in solution via cadvisor if possible.

dashpole commented 4 years ago

I don't have a great answer for you, sadly. To cAdvisor, the pod cgroup is just another cgroup. In theory, cAdvisor could become pod-aware, and query the kubelet or apiserver, for example to get pod metadata. But that would tie it to kubernetes, which we have tried not to do.

Can you sum container metrics? I.e. filter out container = "" (which is the pod), and sum over the pod dimension?

eightnoteight commented 4 years ago

@dashpole

got it, yeah, it does seem to be a bad idea to tie with kubernetes or ecs.

One other thought I had in mind was to add a similar feature like container_hints, similar to how container_hints targets to add networking information which is managed outside of docker, pod level cgroups can also enrich with the labels which are managed outside of cgroup(i.e kubernetes, ecs etc,...), this can be done similar to container_hints using something like cgroup_hints.json file.

I also found an article/patch on adding userland attributes to the cgroups as a part of cgroupfs itself, but it doesn't seem to be accepted due to reliability(limits on attributes values), simplicity, and userland features in kernel space. the mail thread in the article also mentions about keeping such enrichments outside of cgroupfs

https://lwn.net/Articles/484680/

google / cadvisor

Kubernetes Pod / ECS Task Level cgroup metrics with labels #2658