google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.61k stars 2.29k forks source link

Metrics Persist After Pod Deletion #3547

Open bizy01 opened 1 week ago

bizy01 commented 1 week ago

Description We are encountering an issue where metrics for containers (container_start_time_seconds specifically) persist even after the associated Pod has been deleted in our Kubernetes cluster. We are using kubelet to collect container metrics.

Scenario:

A Pod is scheduled and runs normally. Metrics such as container_start_time_seconds are collected by Prometheus via kubelet. The Pod is subsequently deleted, but metrics for the container (like container_start_time_seconds) continue to be collected and are visible in Prometheus. This behavior is unexpected and leads to confusion and potential data inaccuracies in our monitoring system.

Steps to Reproduce Deploy a Pod in the Kubernetes cluster. Verify that metrics such as container_start_time_seconds are being collected for the container. Delete the Pod. Observe that container_start_time_seconds metrics for the deleted container are still present in Prometheus. Expected Behavior After the Pod is deleted, all associated container metrics should no longer be collected or visible in Prometheus.

Observed Behavior Metrics for the deleted Pod's container, such as container_start_time_seconds, continue to be collected and are visible in Prometheus even after the Pod has been deleted.

Configuration Details Kubernetes Version: [v1.19.9] Additional Context We suspect this might be related to how kubelet or cAdvisor handles metric collection and cleanup after a Pod deletion. There might be caching or timing issues causing these metrics to persist.

We have looked into the following potential causes without success:

Ensuring that Prometheus is scraping kubelet endpoints correctly. Checking for any caching mechanisms in kubelet or cAdvisor that might retain these metrics. Reviewing the retention settings in Prometheus. Request for Assistance We would appreciate any insights or suggestions on how to resolve this issue. Specifically, we are looking for:

Confirmation if this is expected behavior or a known issue. Recommendations for configuration changes or patches that can address this problem. Any known workarounds to ensure that metrics for deleted Pods are properly cleaned up. Thank you for your assistance.