open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.01k stars 2.33k forks source link

[k8scluster] add k8s.container.status_waiting_reason metric #32457

Open ElfoLiNk opened 6 months ago

ElfoLiNk commented 6 months ago

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

I would like to get some container state metrics, about waiting reason. One use case is to know whether the container is in CrashLoopBackOff.

Example happening in pod:

kubectl get pod X -o yaml

...
apiVersion: v1
kind: Pod
...
status:
  conditions:
  containerStatuses:
  - containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
    image: docker.io/otel/opentelemetry-collector-contrib:0.97.0
    imageID: docker.io/otel/opentelemetry-collector-contrib@sha256:42a27d048c35720cf590243223543671e9d9f1ad8537d5a35c4b748fc8ebe873
    lastState:
      terminated:
        containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
        exitCode: 2
        finishedAt: "2024-04-16T17:30:04Z"
        reason: Error
        startedAt: "2024-04-16T17:29:35Z"
    name: opentelemetry-collector
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=opentelemetry-collector
          pod=opentelemetry-obs-col-2_obs(58012348-343b-4895-a39e-27e49f014ae8)
        reason: CrashLoopBackOff

Kube State Metrics has this modelled as this Prometheus metric:

kube_pod_container_status_waiting_reason
container=<container-name>
pod=<pod-name>
namespace=<pod-namespace>
reason=<container-waiting-reason>
uid=<pod-uid>

Ref: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

So would be great to have a similar metric.

Describe the solution you'd like

  k8s.container.status_waiting_reason:
    enabled: false
    description: Describes the reason the container is currently in waiting state.
    unit: ""
    attributes:
      - reason
    gauge:
      value_type: int

https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/pod.go#L554-L578

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 6 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv commented 5 months ago

FYI I've opened a PR on semconv for last terminated reason -> https://github.com/open-telemetry/semantic-conventions/issues/922 and looks like some refactorings are needed on my PR. So this time let's first agree if we want this and then make a PR to semconv

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Bhogayata-Keval commented 2 months ago

I am seeing that the k8s.container.status.current_waiting_reason property has been added in Semantic Conventions. Do we need to wait for any more checks before drafting a PR ?

I am happy to contribute, if required.

povilasv commented 2 months ago

FYI this was reverted in https://github.com/open-telemetry/semantic-conventions/pull/1115

see the discussion in original PR https://github.com/open-telemetry/semantic-conventions/pull/997

github-actions[bot] commented 2 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv commented 1 week ago

People keep asking me about this issue, so I think we should solve for it somehow in OTEL.

I'm thinking to propose a simple 0 / 1 state metric, to track if container is waiting for something. This is what Kube State Metrics does with kube_pod_container_status_waiting metric.

My proposal is this:

k8s.container.status.waiting:
    enabled: false
    description:  Wheter container is in waiting state. (0 for now, 1 for yes)
    gauge:
      value_type: int

@TylerHelmuth / @dmitryax thoughts?

I think we already have similiar metrics in Cluster Receiver, so it should fit our current model. Example:

  k8s.container.ready:
    enabled: true
    description: Whether a container has passed its readiness probe (0 for no, 1 for yes)
    unit: ""
    gauge:
      value_type: int
TylerHelmuth commented 1 week ago

I actually ran into this the other week as well and would like a solution. I thought the semantic convention SIG was blocking us on entities?

povilasv commented 1 week ago

Initially I wanted to add resource attribute k8s.container.status.current_waiting_reason which has the actual reason of why Container is in waiting state. Example k8s.container.status.current_waiting_reason=CrashLoopBackOff.

This didn't work due to Resource Attribute immutability.

This new PR actually does a different thing, I'm adding an enum metric, which checks if container is in waiting state or not. So it's a metric that tracks container state, but doesn't tell you the reason.

Given current OTEL model, the actual reason will probably go to Entities as non identifying attribute :thinking: While having waiting state metric IMO still makes sense and is useful.