google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.61k stars 2.29k forks source link

Add documentation for the HousekeepingInterval parameter and enforce validation for it. #3517

Open dsxing opened 2 months ago

dsxing commented 2 months ago

In my testing Kubernetes environment, I observed that if the HousekeepingInterval of cadvisor is set very high by using --housekeeping-interval, for example, greater than 2 minutes, then obtaining container data via kubelet returns null. image

Upon analyzing the cadvisor code, I found that this is due to the fact that the RecentStats of containers only contain one stat data in cadvisor's timed_store. https://github.com/google/cadvisor/blob/master/manager/manager.go#L529; Consequently, it becomes impossible to calculate the CpuStats of containers, resulting in a nil value for stat.CpuInst(https://github.com/google/cadvisor/blob/master/info/v2/conversion.go#L192, https://github.com/google/cadvisor/blob/master/info/v2/conversion.go#L234);

When kubelet aggregates metrics data for all containers on a node, it filters container data. _https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/stats/cadvisor_stats_provider.go#L97_ ; Due to stat.CpuInst=nil,kubelet considers the container is Terminated, _https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/stats/cadvisor_stats_provider.go#L399_ ,resulting in a null value for 'containers' in the final.

The reason why RecentStats of containers in cadvisor's timed_store only contain one data point is because the HousekeepingInterval set by --housekeeping-interval exceeds the expiration period of data in timed_store, In kubelet, this expiration period is defined as 2 minutes by default so the memoryCache's maxAge is 2minutes. _https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cadvisor/cadvisor_linux.go#L55_ ;Therefore, each time housekeeping polls to write data to timed_store, previous data has already expired and been deleted. _https://github.com/google/cadvisor/blob/master/utils/timed_store.go#L78_

Based on this analysis, I believe we should add documentation for the HousekeepingInterval parameter and enforce validation for it, which is setted by --housekeeping-interval.

k8s-ci-robot commented 2 months ago

Hi @dsxing. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dsxing commented 2 months ago

@iwankgb Is it possible to handle the code logic in this way, or should we just add a note for clarification?