kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
643 stars 569 forks source link

AWSMachinePool does not drain nodes during scale-in #2023

Open dthorsen opened 4 years ago

dthorsen commented 4 years ago

/kind bug

What steps did you take and what happened:

This caused the AWSMachineController to set the DesiredInstances in the ASG to 3 without draining nodes at all. The PDB was not honored, and the EC2 instances were terminated by the ASG immediately.

What did you expect to happen: The nodes should have drained gracefully before the EC2 instances are terminated.

Anything else you would like to add: In the current AWSMachinePool implementation, the instance selection for scale-in is performed at the AutoScalingGroup. This could be fixed in the non-cluster-autoscaler case by modifying AWSMachinePool controller to perform node selection for scale-in, drain the selected nodes, and finally utilize the AWS TerminateInstanceInAutoScalingGroup action while setting the request value ShouldDecrementDesiredCapacity: true

We may want to also consider a lifecycle hook on the autoscaling group that prevents ec2 instance termination until the drain completes. This would help to prevent cases where instances are forcibly terminated without draining when the DesiredInstances values are manipulated via the EC2 console, CLI, or APIs.

Environment:

Cluster-api-provider-aws version: Commit: 3338cd4 Kubernetes version: (use kubectl version): v.1.17.9 OS (e.g. from /etc/os-release): Amazon Linux 2

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

randomvariable commented 3 years ago

Chatting with @sedefsavas AWS Node Termination Handler ( https://github.com/aws/aws-node-termination-handler ) can help, but doesn't fully eliminate it - it gives a 2 minute warning.

Sync with CAPZ on MachinePool v.Next

@kschumy , any ideas on what we should do here?

sedefsavas commented 3 years ago

We can follow a similar approach with Openshift's POC about polling termination endpoint: https://github.com/openshift/cluster-api-provider-aws/blob/b4a3478db44ddb554883cf77a9e5f49ffd54fdf4/pkg/termination/handler.go

More on this is discussed in the cluster-api proposal: https://github.com/kubernetes-sigs/cluster-api/pull/3528

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2023#issuecomment-824525754): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
richardcase commented 1 year ago

/reopen /remove-lifecycle rotten

richardcase commented 1 year ago

From office hours 2023-04-03:

/triage accepted /priority important-soon

dlipovetsky commented 1 year ago

Also from office hours discussion:

Users define Pod Disruption Budgets to ensure that their Pods are not voluntarily deleted.

A scale-in of a MachinePool, if it uses the "providers refresh", will always proceed, even if it violates a budget.

For comparison, a scale-in of a MachineDeployment will never proceed if it violates a budget.

k8s-triage-robot commented 1 year ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten