deleted pods still reporting metrics

jpdstan commented 3 years ago

What happened:

it seems that sometimes metrics don't get deleted alongside the pod. It isn't until we churn all the kube-state-metrics pods that it fixes it.

What's even stranger is that it won't be all metrics for that pod that will incorrectly exist; for example, for a particular pod that was deleted, we noticed that it was still reporting kube_pod_container_status_waiting_reason, but not kube_pod_container_resource_requests.

What you expected to happen:

When a pod gets deleted, all metrics associated with that pod should also be deleted.

How to reproduce it (as minimally and precisely as possible):

It's unclear as to how this happens - whenever we try to reproduce by manually deleting a pod and querying for all its metrics ({pod="my_pod"}), it seems to work just fine, i.e. the metrics all disappear.

Anything else we need to know?:

Environment:

kube-state-metrics version: 2.2.0 (though we were experiencing this on 1.5.0 as well)

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: self-hosted k8s on aws

Other info:

  /kube-state-metrics
  --port=9102
  --telemetry-port=8081
  --resources=configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets
  --use-apiserver-cache
  --metric-labels-allowlist=daemonsets=[*],deployments=[*],jobs=[*],nodes=[*],pods=[*],secrets=[*]
  --pod=$(POD_NAME)
  --pod-namespace=$(POD_NAMESPACE)

fpetkovski commented 3 years ago

This could be related to https://github.com/kubernetes/kube-state-metrics/issues/694

fredr commented 3 years ago

Have you checked via kubectl that the pods in this state are actually deleted, and not in some non running state, such as Completed or Evicted?

jpdstan commented 3 years ago

@fredr Yes, they are definitely deleted.

irl-segfault commented 2 years ago

Same thing happening to me on EKS

jpdstan commented 2 years ago

Seeing another instance of this. These two metrics existed at the same time for the pod named taskmanager-0... the IP addresses differ because one IP is old one and the other IP is current one.

kube_pod_labels{
 host="1.1.147.202"
 instance="1.1.147.202:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-4"
 pod="taskmanager-0"
 ...
}

kube_pod_labels{
 host="1.1.188.37"
 instance="1.1.188.37:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-8"
 pod="taskmanager-0"
 ...
}

boniek83 commented 2 years ago

Happens to me with kube_pod_container_resource_requests and "Terminated" pods (but not yet removed by terminated pod garbage collector). KSM version: kube-state-metrics/kube-state-metrics:v2.4.1 I would expect that kube_pod_container_resource_requests would not return terminated pods (or at least expect them correctly labelled so I can filter them).

fpetkovski commented 2 years ago

This case is expected since KSM exposes everything from the apiserver. If you are not interested in terminated pods, you can drop the series using relabeling.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/kube-state-metrics/issues/1569#issuecomment-1229403814): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kube-state-metrics

deleted pods still reporting metrics #1569