Closed DominicRoyStang closed 2 months ago
Hey @DominicRoyStang, so sorry you're running into this. I'll have a look into this! Thanks a lot for the extensive repro!
After digging into this, I found that this is by design.
This happens when Lambda Functions are attached to VPCs. The AWS Lambda service creates and manages an ENI in order for your function to access resources within the VPC. This ENI is not visible to IaC tools like Pulumi and its lifecycle gets managed by AWS Lambda.
AWS Lambda does not delete the ENIs right away when a function is deleted, instead they're aiming for a P99 deletion time of ~35 minutes. Because of this, the provider has a minimum timeout of 45 minutes configured for cleaning up the Lambda ENIs associated with a security group: https://github.com/hashicorp/terraform-provider-aws/blob/be7be81584b81d7b40c0a317882d6666da4444a3/internal/service/ec2/vpc_network_interface.go#L1485-L1489.
If you cannot wait so long for AWSLambda to clean up behind itself, you could switch the security group the Lambda function is using before deleting it, but that will only work if you manage the VPC is a separate stack (otherwise you'll get blocked now on the deletion of this security group). You could even fully automate this by using the pulumi command provider to execute the security group switch on destroy using the AWS CLI.
We should definitely update the docs though. While they mention that security groups associated with Lambda functions can take up to 45 mins to delete, they don't mention that this cannot be changed by specifying a different timeout. I'll get the docs for this updated!
@flostadler thanks for looking into this so quickly and for the detailed explanation.
AWS Lambda does not delete the ENIs right away when a function is deleted, instead they're aiming for a P99 deletion time of ~35 minutes.
I can see why this would override the default timeout, but I'm still not fully understanding why this prevents me from forcing the program to stop execution if deleting a security group is taking more than 2 minutes. Would the delete operation not continue on AWS if the pulumi destroy
execution has halted due to a custom timeout?
If you cannot wait so long for AWSLambda to clean up behind itself, you could switch the security group the Lambda function is using before deleting it, but that will only work if you manage the VPC is a separate stack (otherwise you'll get blocked now on the deletion of this security group). You could even fully automate this by using the pulumi command provider to execute the security group switch on destroy using the AWS CLI.
Thankfully, the VPC is indeed managed in a separate stack in my case. How could I do this security group swap on destroy with Pulumi (preferably without using the AWS CLI)?
We should definitely update the docs though. While they mention that security groups associated with Lambda functions can take up to 45 mins to delete, they don't mention that this cannot be changed by specifying a different timeout. I'll get the docs for this updated!
Agreed. It actually states the opposite (see screenshot). Note that the example also doesn't even include a custom timeout 🤔
@DominicRoyStang
I can see why this would override the default timeout, but I'm still not fully understanding why this prevents me from forcing the program to stop execution if deleting a security group is taking more than 2 minutes. Would the delete operation not continue on AWS if the pulumi destroy execution has halted due to a custom timeout?
The deletion of Security Groups is highly customized in the provider. Multiple AWS API calls are necessary to get this cleaned up (source code ref). What's happening under the hood within the Security Group Deletion is:
If you now stop the deletion after 2 minutes it would still be waiting for Lambda to free of the ENI (step 2a). No delete API call was sent to AWS at this point because they'd fail.
Without having this longer timeout for waiting on the Lambda service to release the ENI delete operations would always fail for Security Groups used with Lambda Functions. That's why the user defined timeout isn't taken into account for this cleanup step.
Thankfully, the VPC is indeed managed in a separate stack in my case. How could I do this security group swap on destroy with Pulumi (preferably without using the AWS CLI)?
A fully automated approach will need to use the AWS CLI with the pulumi command provider to execute something like this. Caveat: this might need some tweaks, I didn't get a chance to test it yet:
const lambdaSgCleanup = new local.Command("lambda-sg-cleanup", {
delete: pulumi.interpolate`aws lambda update-function-configuration --function-name ${LAMBDA_FUNCTION_NAME} --vpc-config SecurityGroupIds=${DEFAULT_SG_ID}`,
});
Docs changes
I was actually referring to this part of the docs that explains that Security Groups associated with Lambdas can take up to 45 mins to destroy.
But you're right this other example needs fixing as well, thanks for bringing this up!
@DominicRoyStang I updated the docs to point this out. Please don't hesitate to reach out if you have any other questions
Thanks for the detailed answer and for updating docs @flostadler!
I tried implementing the security group cleanup using the command provider, but I can't seem to get it working. They still take 20+ minutes to destroy.
My implementation
new local.Command('lambda-security-group-cleanup', {
// Adding sleep 30 to sleep for 30 seconds didn't help (I thought maybe there was some propagation issue)
delete: pulumi.interpolate`aws lambda update-function-configuration --function-name ${lambdaFunction.name} --vpc-config SecurityGroupIds=${DEFAULT_SECURITY_GROUP_ID} && sleep 30`,
}, {
// Adding dependencies ensures that the replacement of lambda security group with the default security group
// happens before the deletion of the lambda security groups,
// but it still takes 20+ minutes to delete the lambda security groups, unfortunately
dependsOn: [lambdaFunction, securityGroup]
})
I found out today that terraform added replace_security_groups_on_destroy
to the AWS lambda function resource https://github.com/hashicorp/terraform-provider-aws/issues/10329#issuecomment-1425914496
Looks like this was added to the @pulumi/aws
provider as replacementSecurityGroupIds
recently as well 🎉
This seems to be working, though there are still occasional cases where, even with that setting enabled, my security groups take 20 minutes to destroy. This is probably as good as it can get for now, which is still pretty good.
@DominicRoyStang Ah that's a good find! Didn't find it initially because it got deprecated once and then re-implemented. That's definitely preferable over the hack with the command provider!
This issue has been addressed in PR #4392 and shipped in release v6.51.0.
Describe what happened
When setting
customTimeouts: { delete: '2m' }
on anaws.ec2.SecurityGroup
, the timeout value is not respected at destroy time.Opening this issue after a discussion on https://github.com/pulumi/pulumi-terraform-bridge/issues/1652#issuecomment-2284188134
Sample program
https://github.com/DominicRoyStang/pulumi-delete-timeout-bug-report
Instructions on the README
Log output
N/A
Affected Resource(s)
aws.ec2.SecurityGroup
Output of
pulumi about
Click to toggle results of `pulumi about`
``` CLI Version 3.129.0 Go Version go1.22.6 Go Compiler gc Plugins KIND NAME VERSION resource aws 6.48.0 resource docker 4.5.4 language nodejs unknown Host OS darwin Version 14.6.1 Arch arm64 This project is written in nodejs: executable='/Users/dom/.asdf/shims/node' version='v20.12.2' Current Stack: organization/pulumi-bug-report/dev-pulumi-bug-report Found no resources associated with dev-pulumi-bug-report Found no pending operations associated with dev-pulumi-bug-report Backend Name C0H6J5 URL s3://REDACTED User dom Organizations Token type personal Pulumi locates its logs in /var/folders/0f/spy7dccj3dq73_tx646_68g40000gp/T/ by default warning: Failed to get information about the Pulumi program's dependencies: failed to run "/Users/dom/.asdf/shims/yarn list --json": exit status 1 ```Additional context
Further details on the repro repo's README.
As noted there, it might take a few tries to get a destroy operation that takes more than 2 minutes. For me, it happens roughly every second try.
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).