hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.84k stars 9.19k forks source link

[Bug]: Unable to remove RDS Global Cluster and associated RDS Clusters at one go #39909

Open ktrenchev opened 2 weeks ago

ktrenchev commented 2 weeks ago

Terraform Core Version

0.13.7

AWS Provider Version

4.53.0

Affected Resource(s)

aws_rds_global_cluster aws_rds_cluster

Expected Behavior

I want to be able to delete both the RDS Global Cluster and the associated RDS Clusters with a single terraform destroy invocation.

Actual Behavior

When terraform destroy is called it: 1) Detaches the replica RDS Cluster from the Global RDS Cluster, thus triggering a promotion. 2) Terraform waits for the replica RDS Cluster to be deleted, but times out as the replica RDS Cluster needs to first be promoted and then deleted, but the promotion process takes longed than the timeout. 3) The replica RDS Cluster is eventually deleted from AWS, but the terraform destroy operation fails to delete the other RDS Cluster and the RDS Global Cluster. 4) A 2nd run of terraform destroy deletes the leftover RDS Global Cluster and RDS Cluster.

Relevant Error/Panic Output Snippet

waiting for RDS Cluster (XXXXXXX) delete: unexpected state 'promoting', wanted target ''. last error: %!s(<nil>

Terraform Configuration Files

N/A, setup is way too complicated to extract the exact configuration.

Steps to Reproduce

1) Create a new RDS Global Cluster. 2) Attach an RDS Cluster (primary). 3) Attach an RDS Cluster (replica). 4) Run terraform destroy.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 2 weeks ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

justinretzolk commented 2 weeks ago

Hey @ktrenchev πŸ‘‹ Thank you for taking the time to raise this! While we understand Terraform configurations can get pretty complicated, it's often quite difficult to reproduce scenarios like this without any logging or configuration samples. Are you able to provide debug logs (redacted as necessary) if you're unable to provide a configuration as you'd initially indicated?

One thing that came to up when taking a quick look at this while triaging was the force_destroy argument of the aws_rds_global_cluster resource, which I believe is meant to help with this scenario. Are you able to confirm whether that argument has been configured?

ktrenchev commented 2 weeks ago

Greetings @justinretzolk!,

Unfortunately I'm unable to provide debug logs. I did play around with the force_destroy argument of RDS Global Cluster resource, but it had no effect. I dug around cluster.go myself and my best estimation is: 1) Either the destruction of the RDS Global Cluster and associated RDS Clusters at one go is intentionally unsupported (AWS docs state something along the lines of "there is no 'one button push' deletion process as RDSs are usually mission critical"). 2) The timeout in waitDBClusterDelete() (called in resourceClusterDelete()) is insufficient as earlier in resourceClusterDelete() RemoveFromGlobalClusterWithContext() is called on the replica and a promotion is triggered.

I'll be happy with a confirmation that the deletion of a Global RDS Cluster and associated RDS Clusters at one go is supported (meaning there is something wrong with my setup, which, unfortunately, is not unlikely).

justinretzolk commented 2 weeks ago

Thanks for the additional information here @ktrenchev πŸ‘ Completely understand re:logging and configuration samples. I'll let someone from the team or community speak to some of the more specifics here.

Edit: I had a thought that using a later provider version may help, given that we've migrated most of the provider to use AWS SDK for Go V2. In doing so, I noticed the following in the release notes for 5.24.0:

It may be worth upgrading to at least provider version 5.24.0 and testing again to see if that bug fix resolves your particular issue.

Fadih commented 1 week ago

@justinretzolk do you know what was changed , i still using same aws provider 5.0.0 like before , but since october 13 its start failing , i cant upgrade my provider to new version because i need to do a lot of changes in my terraform infrastructure

Fadih commented 1 week ago

steps to reproduce , 1) create aws global db 2)add cluster on west region with one instance 3) add replica in east region with one instance 4) try to restack the complete cluster using snapshot

you can see that its start deleting the instance in east region , and then when trying to promote east cluster from the global db , it didn't wait to finish promoting and start the deletion directlly , so its failing on Error: waiting for RDS Cluster (xxxx-dr-global-region-us-east-2-cluster) delete: unexpected state 'promoting', wanted target ''. last error: %!s()

Fadih commented 1 week ago

@justinretzolk i already have the force_destroy on the aws_rds_global_cluster resource and it still happen