0.45.0 - cadvisor / malformed metrics

reefland commented 2 years ago

I have successfully deployed cadvisor 0.45.0 (tried v0.45.0-containerd-cri as well) as daemonset on K3S Kubernetes / Containerd. I've only applied the cadvisor-args.yaml overlay as the others did not seem relevant.

History The bundled K3s (v1.24.3+k3s1) containerd is disabled as it does not support ZFS snapshotter. Instead I'm using the containerd from Ubuntu 22.04 (1.5.9-0ubuntu3) and while it functions perfectly with containers for K3s and ZFS snapshotter, it does not work properly with kubelet / cAdvisor / Prometheus as image= and container= are missing. And a simple Prometheus query such as:

container_cpu_usage_seconds_total{image!=""}

Returned an empty set.

What I See Now It was suggested I try this cadvisor instead, and it is better.. almost but not quiet right. Hopefully I'm just missing something. Now that same Prometheus query returns 111 rows, here is an example for 3:

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="alertmanager", container_label_io_kubernetes_pod_namespace="monitoring", cpu="total", endpoint="http", id="/kubepods/burstable/pod7c0573cd-bba4-4f94-960f-c54cce2bc50e/5ff787742594c67500f255b9926c305246807e92303b43a19c7b95ba1d13dd59", image="quay.io/prometheus/alertmanager:v0.24.0", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="5ff787742594c67500f255b9926c305246807e92303b43a19c7b95ba1d13dd59", namespace="cadvisor", pod="cadvisor-tqbj6"}

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="application-controller", container_label_io_kubernetes_pod_namespace="argocd", cpu="total", endpoint="http", id="/kubepods/burstable/pod9a033e88-9e20-43ef-8632-4551484be608/cedd2605364b981d2b5ec2d5e1eb6ae23abc39d64acf984b85e4f73b8e0a2689", image="quay.io/argoproj/argocd:v2.4.11", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="cedd2605364b981d2b5ec2d5e1eb6ae23abc39d64acf984b85e4f73b8e0a2689", namespace="cadvisor", pod="cadvisor-tqbj6"}

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="applicationset-controller", container_label_io_kubernetes_pod_namespace="argocd", cpu="total", endpoint="http", id="/kubepods/pod5fc900fe-c754-4fe6-a023-b132ab7b0693/6b7b4511e56a66368c210874739d34df90b229d4b69369556b2e9fcc0971abaa", image="quay.io/argoproj/argocd:v2.4.11", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="6b7b4511e56a66368c210874739d34df90b229d4b69369556b2e9fcc0971abaa", namespace="cadvisor", pod="cadvisor-tqbj6"}

What doesn't seem right:

All the containers now equal "cadvisor" instead of the value specified in container_label_io_kubernetes_container_name
All the namespace now equal "cadvisor" instead of the value specified in container_label_io_kubernetes_pod_namespace
All the pods now equal "cadvisor-tqbj6" instead of the value specified in id

A Prometheus Query of container_cpu_usage_seconds_total{image!="",container!="cadvisor"} returns an empty set.

Suggestions?

BBQigniter commented 1 year ago

Had the same issue - and was becoming desperate and pulling my hair. Finally this is my config I came up with and it seems to work with cadvisor v0.46 on a Kubernetes cluster v1.24.8 setup via Rancher.

    # CADVISOR SCRAPE JOB for extra installed cadvisor because of k8s v1.24 with containerd problems where some labels just have empty values on RKE clusters
    - job_name: "kubernetes-cadvisor"
      kubernetes_sd_configs:
        - role: pod  # we get needed info from the pods
          namespaces:
            names: 
              - monitoring  # in namespace monitoring
          selectors:
            - role: pod
              label: "app=cadvisor"  # and only select the cadvisor pods with this label set as source
      metric_relabel_configs:  # we relabel some labels inside the scraped metrics
        # this should look at the scraped metric and replace/add the label inside
        - source_labels: [container_label_io_kubernetes_pod_namespace]
          target_label: "namespace"
        - source_labels: [container_label_io_kubernetes_pod_name]
          target_label: "pod"
        - source_labels: [container_label_io_kubernetes_container_name]
          target_label: "container"

Now the container_*-metrics have labels that are needed in the Grafana-Dashboards we use here for Kubernetes clusters. For example:

container_memory_usage_bytes{container="cadvisor", container_label_io_kubernetes_container_name="cadvisor", container_label_io_kubernetes_pod_name="cadvisor-x6pfx", container_label_io_kubernetes_pod_namespace="monitoring", id="/kubepods/burstable/pod08586cc5-da59-499a-a60b-f7bf859ce7a5/77b8b44fce648487d4ed47dd9b143148e6cccb53ba2a73bfe9277d22f1a305d7", image="sha256:78367b75ee31241d19875ea7a1a6fa06aa42377bba54dbe8eac3f4722fd036b5", instance="10.42.2.139:8080", job="kubernetes-cadvisor", name="k8s_cadvisor_cadvisor-x6pfx_monitoring_08586cc5-da59-499a-a60b-f7bf859ce7a5_0", namespace="monitoring", pod="cadvisor-x6pfx"}

this blog https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2 helped a lot to understand how the different relabel_configs work.

reefland commented 1 year ago

That helped a little... I now have working container name, but still no pod returned by the external cadvisor:

container_cpu_usage_seconds_total{namespace="monitoring", container="grafana"}

container_cpu_usage_seconds_total{container="grafana", container_label_io_kubernetes_container_name="grafana", container_label_io_kubernetes_pod_namespace="monitoring", cpu="total", id="/kubepods/besteffort/pode00da7e6-0e0f-4cd9-aa75-b1e9bab32b38/8959546a3f87530a3059f775191d254d43db3a8ccf17bfa98495ab25a869326d", image="docker.io/grafana/grafana:9.2.4", instance="10.42.0.9:8080", job="kubernetes-cadvisor", name="8959546a3f87530a3059f775191d254d43db3a8ccf17bfa98495ab25a869326d", namespace="monitoring"}

See it has the name="8959546a3f87530a3059f775191d254d43db3a8ccf17bfa98495ab25a869326d" value and not the pod name.

Whereas the kubelet cadvisor does have the pod name:

container_cpu_usage_seconds_total{namespace="monitoring", pod=~"grafana.*"}:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pode00da7e6-0e0f-4cd9-aa75-b1e9bab32b38", instance="testlinux", job="kubelet", metrics_path="/metrics/cadvisor", namespace="monitoring", node="testlinux", pod="grafana-ff88df95-lbvr2", service="prometheus-kubelet"}

Did you apply any of the overlays such as cadvisor-args.yaml ??

BBQigniter commented 1 year ago

hmm, strange.

not completely sure what's going on here on our systems as the cadvisor-stuff was set up by a colleague who left the company a few weeks ago, left a mess, and I have now to figure out how to fix the prometheus/prometheus-operator setup etc. :|

TLDR. I had a look and it seems the cadvisors run with following arguments :D

--housekeeping_interval=2s 
--max_housekeeping_interval=15s 
--event_storage_event_limit=default=0 
--event_storage_age_limit=default=0 
--enable_metrics=app,cpu,disk,diskIO,memory,network,process 
--docker_only 
--store_container_labels=false 
--whitelisted_container_labels=io.kubernetes.container.name, io.kubernetes.pod.name,io.kubernetes.pod.namespace, io.kubernetes.pod.name,io.kubernetes.pod.name

you see io.kubernetes.pod.name is in there multiple times :shrug: - where it's only seen once in the example

reefland commented 1 year ago

Even, stranger.. I noticed that the only 2 I had working were the ones with NO SPACES in the --whitelisted_container_labels field as shown above (from cadvisor-args.yaml overlay file). I removed that space and it started to work!

Weird.

MisderGAO commented 1 year ago

Had the same issue - and was becoming desperate and pulling my hair. Finally this is my config I came up with and it seems to work with cadvisor v0.46 on a Kubernetes cluster v1.24.8 setup via Rancher.

    # CADVISOR SCRAPE JOB for extra installed cadvisor because of k8s v1.24 with containerd problems where some labels just have empty values on RKE clusters
    - job_name: "kubernetes-cadvisor"
      kubernetes_sd_configs:
        - role: pod  # we get needed info from the pods
          namespaces:
            names: 
              - monitoring  # in namespace monitoring
          selectors:
            - role: pod
              label: "app=cadvisor"  # and only select the cadvisor pods with this label set as source
      metric_relabel_configs:  # we relabel some labels inside the scraped metrics
        # this should look at the scraped metric and replace/add the label inside
        - source_labels: [container_label_io_kubernetes_pod_namespace]
          target_label: "namespace"
        - source_labels: [container_label_io_kubernetes_pod_name]
          target_label: "pod"
        - source_labels: [container_label_io_kubernetes_container_name]
          target_label: "container"

Now the container_*-metrics have labels that are needed in the Grafana-Dashboards we use here for Kubernetes clusters. For example:

container_memory_usage_bytes{container="cadvisor", container_label_io_kubernetes_container_name="cadvisor", container_label_io_kubernetes_pod_name="cadvisor-x6pfx", container_label_io_kubernetes_pod_namespace="monitoring", id="/kubepods/burstable/pod08586cc5-da59-499a-a60b-f7bf859ce7a5/77b8b44fce648487d4ed47dd9b143148e6cccb53ba2a73bfe9277d22f1a305d7", image="sha256:78367b75ee31241d19875ea7a1a6fa06aa42377bba54dbe8eac3f4722fd036b5", instance="10.42.2.139:8080", job="kubernetes-cadvisor", name="k8s_cadvisor_cadvisor-x6pfx_monitoring_08586cc5-da59-499a-a60b-f7bf859ce7a5_0", namespace="monitoring", pod="cadvisor-x6pfx"}

this blog https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2 helped a lot to understand how the different relabel_configs work.

I have the same problem on

rke: 1.24.8
monitoring stack deployed from rancher-monitoring chart (where cadvisor doesn't existe)

container_cpu_usage_seconds_total

all results returned from above promQL does not have image field, it's quite strange. the same monitoring chart works well in rke v1.20.8 (without cadvisor)

ngadde commented 2 months ago

For me its fix by configuring containerd path to : --containerd=/run/k3s/containerd/containerd.sock after that its start showing metrics

google / cadvisor

0.45.0 - cadvisor / malformed metrics #3162