kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.06k stars 39.66k forks source link

hpa get cpu utilization from deleted pod #86297

Closed zjj2wry closed 4 years ago

zjj2wry commented 4 years ago

What happened:

wlx-prd              33m         Warning   FailedComputeMetricsReplicas   HorizontalPodAutoscaler   horizontal-pod-autoscaler            failed to get cpu utilization: did not receive metrics for any ready pods                                                                                                                 63d          30      epnu-dataquality.15cd6629a893ce13
$ kubectl get po epnu-dataquality.15cd6629a893ce13 -n wlx-prd
Error from server (NotFound): pods "epnu-dataquality.15cd6629a893ce13" not found

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

zjj2wry commented 4 years ago

/sig autoscaling

wy100101 commented 4 years ago

I'm running into this and it basically makes the HPA unusable for me because CPU spikes up during graceful termination due to cleanup activities for our app.

I've looked at the code as far I can understand the workings of things like kubernetes delete operations and it looks like a deleted pod should get skipped since DeletionTimestamp should be set before the TERM signal is set so the pods should be skipped by replica_calculator#GroupPods()

Is there a race condition or a bug in the code?

wy100101 commented 4 years ago

Is the issue that the HPA is using a caching lister to get the pods? Maybe the calculation is based on a cached view of the pod and the HPA doesn't realize the pod has been deleted.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wy100101 commented 4 years ago

I think this is still an issue.

On Tue, Jun 9, 2020, 22:14 Jiajin Zheng notifications@github.com wrote:

Closed #86297 https://github.com/kubernetes/kubernetes/issues/86297.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/86297#event-3427505700, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAG4C35MVFFCGCV2MMUWPDRV3T6PANCNFSM4J3FG63Q .