kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.43k stars 2.02k forks source link

Directly emit container ready time metric #2119

Open bbdouglas opened 1 year ago

bbdouglas commented 1 year ago

What would you like to be added:

It would be great to have a metric for the container ready time in seconds to be emitted directly. There is currently a boolean gauge kube_pod_container_status_ready, which emits whether the container is ready or not, but that requires some computation to get at the time when the container flipped to the ready state. I'm interested in learning the amount of time it took between when the container started and when it was ready, and that would be simpler and more efficient to measure if kube-state-metrics emitted the ready time directly.

There was a similar metric added at the pod level (#1465), but this would be at the container level. In the pods that I am tracking, there are many containers with wildly varying ready times, so it is helpful for debugging and optimization purposes to know how long each container takes to get ready.

Why is this needed:

Similar to the pod-level ready time metric (#1465), I'd like to measure the ready time of each individual container within my pod. This is helpful for tracking startup-times at a finer level of granularity than the whole pod, especially when a pod has many containers.

It is possible to use the existing boolean kube_pod_container_status_ready boolean to calculate this by looking at a series of data points and choosing the first point in time when that flag flips from false to true, but in practice that can be very resource intensive for Prometheus to calculate if there are a large number of pods/containers.

Describe the solution you'd like

I would ideally like to see a new metric analogous to kube_pod_status_ready_time emitted at the container granularity.

Additional context

I'm not that familiar with the internals of the Kubernetes API, but unfortunately it does not look like ContainerStatus has the same breadth of information as PodCondition, which includes a LastTransitionTime. So this might not be a simple addition.

dashpole commented 1 year ago

/triage accepted /assign @dgrisonnet

dgrisonnet commented 1 year ago

The container level metric should already be available: https://github.com/kubernetes/kube-state-metrics/blob/02417fbc99f3adec84834fc59d5f89cf676ce006/internal/store/pod.go#L1342

bbdouglas commented 1 year ago

Hi @dgrisonnet, thanks for looking into this.

Unfortunately, I believe the metric you pointed to is actually at the pod level, representing the time that all containers are ready (ContainersReady). From the comments in the api:

// ContainersReady indicates whether all containers in the pod are ready.
dgrisonnet commented 1 year ago

Correct, the name got me.

We should probably base kube_pod_status_container_ready_time on ContainerStatus rather than on the pod status.

abhiraut commented 10 months ago

It is possible to use the existing boolean kube_pod_container_status_ready boolean to calculate this by looking at a series of data points and choosing the first point in time when that flag flips from false to true

@bbdouglas I am curious how you currently calculate this with promQL?

bbdouglas commented 10 months ago

@abhiraut Here is the query I came up with. Since it's looking back, you have to manually set the maximum age that you expect a pod to be up. Here I have assumed no pod lives for more than 1 day.

min_over_time(timestamp(kube_pod_container_status_ready{container="mycontainer", pod_phase="Running"} == 1)[1d])
abhiraut commented 10 months ago

thanks ! @dgrisonnet do you think we can directly emit the ready time? i think it would be helpful and consistent with how the readiness is emitted at Pod level.