google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.92k stars 2.31k forks source link

Addition of an instantaneous CPU Usage Metric #837

Open alexmavr opened 9 years ago

alexmavr commented 9 years ago

Even though the existing cumulative CPU Usage metric is great, as it allows users to get usage up to nanosecond granularity, in a large set of use cases an instantaneous CPU Usage metric for a standardized duration (e.g. 1 sec) would suffice and be more helpful.

This would address two issues:

mikedanese commented 9 years ago

It's also important that we separate the responsibility of exporting raw metrics and transforming/aggregating/deriving statistics. Since cAdvisor is a node agent, cpu/memory is at a premium and many use cases will prefer to move raw metrics off the node before applying transformations. Maybe this could be implemented in a storage backend?

mikedanese commented 9 years ago

It's also possible that this info is already available from accounting in which case we should just expose it...

alexmavr commented 9 years ago

I guess this choice boils down to the design choice of functionality separation versus usability and consistency.

Cadvisor is already performing more advanced node-level derivations than this, such as 90th percentiles and moving averages, with more such stats coming up in the following weeks.
In an ideal world, CAdvisor would only export raw metrics, and there would be another layer on top of that that performs derivation and aggregation, in the similar way that is now partially performed in CAdvisor, Heapster, Kubelets and timeseries database clients.

However, in terms of consistency and usability, cpu_cumulative_usage is the only cumulative stat that CAdvisor exposes. If data precision and functionality isolation are really a primary concern, then it would only make sense to export all other stats to their cumulative versions as well; and implement a configurable layer on top of CAdvisor that can extract instantaneous and derived stats from these cumulative metrics.

However, since that layer does not exist, CAdvisor has already been bloated up with metric derivations for hardcoded time periods, as these stats are what's useful to a large set users of cadvisor.

A simple search of CAdvisor and Heapster issues with "cpu usage" or "cpu_cumulative_usage" reveals that the case of the cpu usage stat is problematic with many users, as it causes them to have to implement their own handling logic only for this metric. Needless to mention that there are issues, such as crash-looping containers, where the end user would also have to check a container's uptime to be able to derive the instantaneous CPU usage, whereas instantaneous memory, network and filesystem metrics are readily available.

Finally, the existing approach of exporting only cpu usage as a cumulative stat propagates a design problem of cadvisor (separation of raw data exporting and metric derivation) to the codebase of each consumer. For that reason, CAdvisor should be able to serve both types of users in a consistent pattern:

\cc @rjnagal @vmarmol @vishh