kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.01k stars 39.64k forks source link

Pod stays in "OutOfpods" state on a scaled out node created by cluster autoscaler. #119960

Closed abhijit-dev82 closed 1 month ago

abhijit-dev82 commented 1 year ago

What happened?

On a k8s 1.26 cluster with cluster autoscaler enabled, with min size as 1 and max as 5 , scaled out an application deployment to 350 replicas. The worker nodes size was of 1 worker node , after scaling out the application deployment, Pods go to pending state and trigger the cluster autoscaler to scale out to 4 worker nodes

root[ ~ ]# k get md -n autoscaler-cc265-11ns1
NAME                                                 CLUSTER                            REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE       AGE   VERSION
autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd   autoscaler-cc265-11ns1-c1   4          1       4         3             ScalingUp   38h   v1.26.5

root [ ~ ]# kcl get nodes
NAME                                                              STATUS     ROLES           AGE   VERSION
autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd-696954d5h2p9   NotReady   <none>          13s   v1.26.5
autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd-696954d9nz94   NotReady   <none>          1s    v1.26.5
autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd-696954db68m8   NotReady   <none>          11s   v1.26.5
autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd-696954djqbn9   Ready      <none>          77m   v1.26.5
autoscaler-cc265-11ns1-c1-vzxhp-6kx88                      Ready      control-plane   38h   v1.26.5+

After the nodes are ready Pods get scheduled on the new nodes, but observe one Pod which goes to "OutOfPods" state.

`QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Topology Spread Constraints: kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/name=argocd-dex-server Events: Type Reason Age From Message


Normal TriggeredScaleUp 25m cluster-autoscaler pod triggered scale-up: [{MachineDeployment/autoscaler-cc265-11ns1/autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd 1->4 (max: 5)}] Warning FailedScheduling 24m (x2 over 25m) default-scheduler 0/2 nodes are available: 1 Too many pods, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.. Warning FailedScheduling 22m default-scheduler 0/5 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 Too many pods, 2 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.. Normal Scheduled 22m default-scheduler Successfully assigned default/argocd-dex-server-58cb8749b4-zmzbj to autoscaler-cc265-11ns1-c1-np-1-worker-7wmpd-696954db68m8 Warning OutOfpods 22m kubelet Node didn't have enough resource: pods, requested: 1, used: 110, capacity: 110 root@4211450cf3409357f8aea6c23011ec78 [ ~ ]# `

What did you expect to happen?

The Pod which cannot be accomodated on a worker node should not have been scheduled on it and should stay in "Pending" state. This would have triggered cluster autoscaler scale out and the Pod should have scheduled on this node.

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster with Cluster Autoscaler enabled with Min = 1 and Max=5 . Scale out application deployment to a number > 300.

Anything else we need to know?

Slack discussion: https://kubernetes.slack.com/archives/C09TP78DV/p1691159010212509

The scheduler doesnt take into account if any static Pods are coming up on a new node before scheduling the Pending Pods on the new scaled out node. Hence the scheduled Pod gets pushed to "OutOfPods" state after being scheduled on a new node.

Kubernetes version

```console $ kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.4", GitCommit:"431e801e781737a2d1347c449f3c8d284395a5d7", GitTreeState:"clean", BuildDate:"2023-06-22T02:12:35Z", GoVersion:"go1.19.8 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.5.1", GitCommit:"23a5ab39c3188caaf651128b0dfed523eecd8023", GitTreeState:"clean", BuildDate:"2023-07-03T15:25:46Z", GoVersion:"go1.19.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

VSphere

OS version

```console # On Linux: $ cat /etc/os-release NAME="VMware Photon OS" VERSION="3.0" ID=photon VERSION_ID=3.0 PRETTY_NAME="VMware Photon OS/Linux" ANSI_COLOR="1;34" HOME_URL="https://vmware.github.io/photon/" BUG_REPORT_URL="https://github.com/vmware/photon/issues" $ uname -a root@4211450cf3409357f8aea6c23011ec78 [ ~ ]# uname -a Linux 4211450cf3409357f8aea6c23011ec78 4.19.283-3.ph3-esx #1-photon SMP Fri Jun 16 02:25:00 UTC 2023 x86_64 GNU/Linux root@4211450cf3409357f8aea6c23011ec78 [ ~ ]#

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

abhijit-dev82 commented 1 year ago

/sig scheduling /sig autoscaling

alculquicondor commented 1 year ago

/remove-sig scheduling /remove-sig autoscaling /sig node /sig networking

I think the fix here should be to make kube-proxy part of the Node readiness checks.

k8s-ci-robot commented 1 year ago

@alculquicondor: The label(s) sig/networking cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/kubernetes/issues/119960#issuecomment-1680500663): >/remove-sig scheduling >/remove-sig autoscaling >/sig node >/sig networking > >I think the fix here should be to make kube-proxy part of the Node readiness checks. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
aojea commented 1 year ago

I think the fix here should be to make kube-proxy part of the Node readiness checks.

but is this cluster using kube-proxy as an static pod? the issue says this is a vsphere cluster with vmware OS, I thought those were using antrea ...

what happens if there are more static pods? you have to add all the static pods as part of the node readiness check, that will solve the scheduling problem but will impact the node startup readiness

alculquicondor commented 1 year ago

I see, thanks for the clarification.

But more generally, any static pod would cause problems to scheduling.

you have to add all the static pods as part of the node readiness check, that will solve the scheduling problem but will impact the node startup readiness

But can a node really be considered ready if the static pods are not ready?

ndixita commented 1 year ago

/triage accepted

HirazawaUi commented 1 year ago

/cc

k8s-triage-robot commented 1 month ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

alculquicondor commented 1 month ago

/close as duplicate of #115325

k8s-ci-robot commented 1 month ago

@alculquicondor: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/119960#issuecomment-2368852832): >/close >as duplicate of #115325 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.