Make the delete reconcile loop more robust to errors

kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.

http://cluster-api-aws.sigs.k8s.io/

Apache License 2.0

643 stars 569 forks source link

Make the delete reconcile loop more robust to errors #1335

Open erwinvaneyk opened 5 years ago

erwinvaneyk commented 5 years ago

/kind bug

What steps did you take and what happened:

I created a capa cluster, with an account that was missing a required permission (ELB).
The controller provisioned parts of the cluster, until it tried to deploy the ELB.
I tried to delete the cluster.
The controller then became stuck deleting the cluster, because it lacked the permission again.
Other cluster components (e.g. VPC) remain deployed and cannot be deleted without manual intervention.

What did you expect to happen: Although this specific issue is comes down to a misconfiguration on my part, it seems like this issue would be there for any type of non-transient error during the cluster deployment.

So, I would expect two things to happen:

The controller should try to delete all components, regardless of whether some fail to be deleted.
The controller should not fail trying to delete components that it did not create in the first place.

Environment:

Cluster-api-provider-aws version: v0.4.3
Kubernetes version: (use kubectl version): v1.16.2

If this is an actual issue that is within the scope of capa, I would be happy to contribute a patch myself. 🙂

detiber commented 5 years ago

I think it is probably okay to continue with deletion, skipping over resources that we do not have permissions to delete, assuming that we also attempt to describe the resource first.

It's probably a safe bet that if we lack permissions to describe or delete the resource, then we most likely lacked the permissions to create the resource and the chance of orphaning a resource would be slim to none.

This might get a bit tricky around some of the resources that we manage through transitive dependencies of other resources, so it might require some special handling on a case by case basis.

ncdc commented 4 years ago

@randomvariable please add some info on the dependency ordering of AWS components

joonas commented 4 years ago

@randomvariable bump

ncdc commented 4 years ago

Trying to de-scope v0.5. Moved to Next.

randomvariable commented 4 years ago

Definitely next. Quite a bit of refactoring to be done to make this happen.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

detiber commented 4 years ago

/lifecycle frozen

richardcase commented 2 years ago

/remove-lifecycle frozen

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

richardcase commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten