Deleting `pendingDeletes` at beginning of deployment leads to stuck states

lukehoban commented 5 years ago

Today, we process any pendingDeletes at the beginning of a deployment. This is not "correct".

Two examples:

First, a program with a VPC and an Instance. A change causes the VPC to be replaced, and the Instance fails to create. This leads to a newly created VPC, and a pending delete VPC. On the next update, we try to flush the pending deletes, meaning trying to delete the old VPC. This fails, because the Instance is still running in the old VPC. It is only "correct" to delete the old VPC at the end of the deployment after all other repercussions of the replacement have been made.

Second, a Kubernetes Provider and a Kubernetes Resource. A change causes the Kubernetes Provider to be replaced, but the Kubernetes Resource fails to create. This leads to a newly created Provider in the checkpoint, and a pending delete Provider in the checkpoint. On the next update, we successfully delete the pending delete Provider from the checkpoint. However, now all of the references in the checkpoint have provider references to a provider which does not exist. When we try to process the recreate of the Kubernetes Resource, it fails with a message like:

resource urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::kubernetes:core/v1:Secret::langserver-auth refers to unknown provider urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::pulumi:providers:kubernetes::dogfood-full-k8s::3a90eb1d-d8d5-4272-ae29-300c34caaab9

To be correct, I believe we will need to postpone pending deletes to the end of the deployment.

lukehoban commented 5 years ago

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

pgavlin commented 5 years ago

We've decided that this is too risky a change to take at this point in Q3. We will fix this ASAP post-1.0.

mdcuk34 commented 3 years ago

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:


     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
        * Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
        status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
        * Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
        status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

blampe commented 2 years ago

This would be a great quality of life improvement! I've run into both of the problems Luke mentioned in the description.

lukehoban commented 2 years ago

Another member of the internal team hit this today.

Their first updated did the create side of a replacement of a LaunchConfiguration.

++  aws:ec2:LaunchConfiguration ecsClusterInstanceLaunchConfiguration create-replacementd

Then the update failed for a reasonable reason.

The next update they did failed almost immediately with:

ecsClusterInstanceLaunchConfiguration (aws:ec2:LaunchConfiguration)
completing deletion from previous update

error: deleting urn:pulumi:kimberley::pulumi-service::aws:ec2/launchConfiguration:LaunchConfiguration::ecsClusterInstanceLaunchConfiguration: 1 error occurred:
    * error deleting Autoscaling Launch Configuration (ecsClusterInstanceLaunchConfiguration-13f7e0f): ResourceInUse: Cannot delete launch configuration ecsClusterInstanceLaunchConfiguration-13f7e0f because it is attached to AutoScalingGroup autoScalingGroupStack-4a63cb8-Instances-L4JB1QE2ZJ6J
    status code: 400, request id: 88a8a416-5fd0-48a9-9d5d-52358c77e2df

jonasgroendahl commented 2 years ago

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:
     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
      * Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
      status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
      * Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
      status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

got exactly this error too,

merely changed some vpc stuff and then it decided it was time to delete the target group and now can't get rid of "completing deletion from previous update..." state

parryian commented 2 years ago

any suggested workarounds for this issue?

ralvarez-globant commented 2 years ago

Same here... tried to manually remove the resource from the stack to no avail. Now my stack has two identical resources...

Do you want to perform this update? yes Updating (CLIENT/ENV)

View Live: https://app.pulumi.com/CLIENT/STACK/ENV/updates/NN

 Type                       Name                                  Status                  Info
 pulumi:pulumi:Stack        RESOURCE_NAME  **failed**              1 error

└─ gcp:projects:IAMMember BINDING_NAME deleting failed 1 error

Diagnostics: gcp:projects:IAMMember (BINDING_NAME): error: unable to find required configuration setting: GCP Project Set the GCP Project by using: pulumi config set gcp:project <project>

Resources:

Duration: 2s

ralvarez-globant commented 2 years ago

Ok, found a workaround. It's not pretty but does the job:

Export your current state (and back it up)
Backup your stack (just in case)
Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)
Make sure you are configuring the proper stack
Import modified stack
Refresh and keep on wherever you left

pulumi stack export -s STACK > stack.json
cp stack.json stack.json.origin
vi stack.json
pulumi stack select STACK
pulumi stack import < stack.json
pulumi up

solomonshorser commented 1 year ago

@ralvarez-globant You say:

3. Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)

Did you just remove the problematic resource itself? I would imagine that you need to remove any other resources the reference it as a dependency, as well.

pulumi / pulumi

Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948