google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.96k stars 2.31k forks source link

0.45.0 - cadvisor / malformed metrics #3162

Open reefland opened 2 years ago

reefland commented 2 years ago

I have successfully deployed cadvisor 0.45.0 (tried v0.45.0-containerd-cri as well) as daemonset on K3S Kubernetes / Containerd. I've only applied the cadvisor-args.yaml overlay as the others did not seem relevant.

History The bundled K3s (v1.24.3+k3s1) containerd is disabled as it does not support ZFS snapshotter. Instead I'm using the containerd from Ubuntu 22.04 (1.5.9-0ubuntu3) and while it functions perfectly with containers for K3s and ZFS snapshotter, it does not work properly with kubelet / cAdvisor / Prometheus as image= and container= are missing. And a simple Prometheus query such as:

container_cpu_usage_seconds_total{image!=""}

Returned an empty set.

What I See Now It was suggested I try this cadvisor instead, and it is better.. almost but not quiet right. Hopefully I'm just missing something. Now that same Prometheus query returns 111 rows, here is an example for 3:

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="alertmanager", container_label_io_kubernetes_pod_namespace="monitoring", cpu="total", endpoint="http", id="/kubepods/burstable/pod7c0573cd-bba4-4f94-960f-c54cce2bc50e/5ff787742594c67500f255b9926c305246807e92303b43a19c7b95ba1d13dd59", image="quay.io/prometheus/alertmanager:v0.24.0", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="5ff787742594c67500f255b9926c305246807e92303b43a19c7b95ba1d13dd59", namespace="cadvisor", pod="cadvisor-tqbj6"}

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="application-controller", container_label_io_kubernetes_pod_namespace="argocd", cpu="total", endpoint="http", id="/kubepods/burstable/pod9a033e88-9e20-43ef-8632-4551484be608/cedd2605364b981d2b5ec2d5e1eb6ae23abc39d64acf984b85e4f73b8e0a2689", image="quay.io/argoproj/argocd:v2.4.11", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="cedd2605364b981d2b5ec2d5e1eb6ae23abc39d64acf984b85e4f73b8e0a2689", namespace="cadvisor", pod="cadvisor-tqbj6"}

container_cpu_usage_seconds_total{container="cadvisor", container_label_io_kubernetes_container_name="applicationset-controller", container_label_io_kubernetes_pod_namespace="argocd", cpu="total", endpoint="http", id="/kubepods/pod5fc900fe-c754-4fe6-a023-b132ab7b0693/6b7b4511e56a66368c210874739d34df90b229d4b69369556b2e9fcc0971abaa", image="quay.io/argoproj/argocd:v2.4.11", instance="10.42.0.143:8080", job="monitoring/cadvisor-prometheus-podmonitor", name="6b7b4511e56a66368c210874739d34df90b229d4b69369556b2e9fcc0971abaa", namespace="cadvisor", pod="cadvisor-tqbj6"}

What doesn't seem right:

A Prometheus Query of container_cpu_usage_seconds_total{image!="",container!="cadvisor"} returns an empty set.

Suggestions?

BBQigniter commented 1 year ago

Had the same issue - and was becoming desperate and pulling my hair. Finally this is my config I came up with and it seems to work with cadvisor v0.46 on a Kubernetes cluster v1.24.8 setup via Rancher.

    # CADVISOR SCRAPE JOB for extra installed cadvisor because of k8s v1.24 with containerd problems where some labels just have empty values on RKE clusters
    - job_name: "kubernetes-cadvisor"
      kubernetes_sd_configs:
        - role: pod  # we get needed info from the pods
          namespaces:
            names: 
              - monitoring  # in namespace monitoring
          selectors:
            - role: pod
              label: "app=cadvisor"  # and only select the cadvisor pods with this label set as source
      metric_relabel_configs:  # we relabel some labels inside the scraped metrics
        # this should look at the scraped metric and replace/add the label inside
        - source_labels: [container_label_io_kubernetes_pod_namespace]
          target_label: "namespace"
        - source_labels: [container_label_io_kubernetes_pod_name]
          target_label: "pod"
        - source_labels: [container_label_io_kubernetes_container_name]
          target_label: "container"

Now the container_*-metrics have labels that are needed in the Grafana-Dashboards we use here for Kubernetes clusters. For example:

container_memory_usage_bytes{container="cadvisor", container_label_io_kubernetes_container_name="cadvisor", container_label_io_kubernetes_pod_name="cadvisor-x6pfx", container_label_io_kubernetes_pod_namespace="monitoring", id="/kubepods/burstable/pod08586cc5-da59-499a-a60b-f7bf859ce7a5/77b8b44fce648487d4ed47dd9b143148e6cccb53ba2a73bfe9277d22f1a305d7", image="sha256:78367b75ee31241d19875ea7a1a6fa06aa42377bba54dbe8eac3f4722fd036b5", instance="10.42.2.139:8080", job="kubernetes-cadvisor", name="k8s_cadvisor_cadvisor-x6pfx_monitoring_08586cc5-da59-499a-a60b-f7bf859ce7a5_0", namespace="monitoring", pod="cadvisor-x6pfx"}

this blog https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2 helped a lot to understand how the different relabel_configs work.

reefland commented 1 year ago

That helped a little... I now have working container name, but still no pod returned by the external cadvisor:

container_cpu_usage_seconds_total{namespace="monitoring", container="grafana"}

container_cpu_usage_seconds_total{container="grafana", container_label_io_kubernetes_container_name="grafana", container_label_io_kubernetes_pod_namespace="monitoring", cpu="total", id="/kubepods/besteffort/pode00da7e6-0e0f-4cd9-aa75-b1e9bab32b38/8959546a3f87530a3059f775191d254d43db3a8ccf17bfa98495ab25a869326d", image="docker.io/grafana/grafana:9.2.4", instance="10.42.0.9:8080", job="kubernetes-cadvisor", name="8959546a3f87530a3059f775191d254d43db3a8ccf17bfa98495ab25a869326d", namespace="monitoring"}

Whereas the kubelet cadvisor does have the pod name:

container_cpu_usage_seconds_total{namespace="monitoring", pod=~"grafana.*"}:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pode00da7e6-0e0f-4cd9-aa75-b1e9bab32b38", instance="testlinux", job="kubelet", metrics_path="/metrics/cadvisor", namespace="monitoring", node="testlinux", pod="grafana-ff88df95-lbvr2", service="prometheus-kubelet"}

Did you apply any of the overlays such as cadvisor-args.yaml ??

BBQigniter commented 1 year ago

hmm, strange.

not completely sure what's going on here on our systems as the cadvisor-stuff was set up by a colleague who left the company a few weeks ago, left a mess, and I have now to figure out how to fix the prometheus/prometheus-operator setup etc. :|

TLDR. I had a look and it seems the cadvisors run with following arguments :D

--housekeeping_interval=2s 
--max_housekeeping_interval=15s 
--event_storage_event_limit=default=0 
--event_storage_age_limit=default=0 
--enable_metrics=app,cpu,disk,diskIO,memory,network,process 
--docker_only 
--store_container_labels=false 
--whitelisted_container_labels=io.kubernetes.container.name, io.kubernetes.pod.name,io.kubernetes.pod.namespace, io.kubernetes.pod.name,io.kubernetes.pod.name

you see io.kubernetes.pod.name is in there multiple times :shrug: - where it's only seen once in the example

reefland commented 1 year ago

Even, stranger.. I noticed that the only 2 I had working were the ones with NO SPACES in the --whitelisted_container_labels field as shown above (from cadvisor-args.yaml overlay file). I removed that space and it started to work!

image

Weird.

MisderGAO commented 1 year ago

Had the same issue - and was becoming desperate and pulling my hair. Finally this is my config I came up with and it seems to work with cadvisor v0.46 on a Kubernetes cluster v1.24.8 setup via Rancher.

    # CADVISOR SCRAPE JOB for extra installed cadvisor because of k8s v1.24 with containerd problems where some labels just have empty values on RKE clusters
    - job_name: "kubernetes-cadvisor"
      kubernetes_sd_configs:
        - role: pod  # we get needed info from the pods
          namespaces:
            names: 
              - monitoring  # in namespace monitoring
          selectors:
            - role: pod
              label: "app=cadvisor"  # and only select the cadvisor pods with this label set as source
      metric_relabel_configs:  # we relabel some labels inside the scraped metrics
        # this should look at the scraped metric and replace/add the label inside
        - source_labels: [container_label_io_kubernetes_pod_namespace]
          target_label: "namespace"
        - source_labels: [container_label_io_kubernetes_pod_name]
          target_label: "pod"
        - source_labels: [container_label_io_kubernetes_container_name]
          target_label: "container"

Now the container_*-metrics have labels that are needed in the Grafana-Dashboards we use here for Kubernetes clusters. For example:

container_memory_usage_bytes{container="cadvisor", container_label_io_kubernetes_container_name="cadvisor", container_label_io_kubernetes_pod_name="cadvisor-x6pfx", container_label_io_kubernetes_pod_namespace="monitoring", id="/kubepods/burstable/pod08586cc5-da59-499a-a60b-f7bf859ce7a5/77b8b44fce648487d4ed47dd9b143148e6cccb53ba2a73bfe9277d22f1a305d7", image="sha256:78367b75ee31241d19875ea7a1a6fa06aa42377bba54dbe8eac3f4722fd036b5", instance="10.42.2.139:8080", job="kubernetes-cadvisor", name="k8s_cadvisor_cadvisor-x6pfx_monitoring_08586cc5-da59-499a-a60b-f7bf859ce7a5_0", namespace="monitoring", pod="cadvisor-x6pfx"}

this blog https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2 helped a lot to understand how the different relabel_configs work.

I have the same problem on

container_cpu_usage_seconds_total

all results returned from above promQL does not have image field, it's quite strange. the same monitoring chart works well in rke v1.20.8 (without cadvisor)