ktrenchev commented 2 weeks ago

Terraform Core Version

0.13.7

AWS Provider Version

4.53.0

Affected Resource(s)

aws_rds_global_cluster aws_rds_cluster

Expected Behavior

I want to be able to delete both the RDS Global Cluster and the associated RDS Clusters with a single terraform destroy invocation.

Actual Behavior

When terraform destroy is called it: 1) Detaches the replica RDS Cluster from the Global RDS Cluster, thus triggering a promotion. 2) Terraform waits for the replica RDS Cluster to be deleted, but times out as the replica RDS Cluster needs to first be promoted and then deleted, but the promotion process takes longed than the timeout. 3) The replica RDS Cluster is eventually deleted from AWS, but the terraform destroy operation fails to delete the other RDS Cluster and the RDS Global Cluster. 4) A 2nd run of terraform destroy deletes the leftover RDS Global Cluster and RDS Cluster.

Relevant Error/Panic Output Snippet

waiting for RDS Cluster (XXXXXXX) delete: unexpected state 'promoting', wanted target ''. last error: %!s(<nil>

Terraform Configuration Files

N/A, setup is way too complicated to extract the exact configuration.

Steps to Reproduce

1) Create a new RDS Global Cluster. 2) Attach an RDS Cluster (primary). 3) Attach an RDS Cluster (replica). 4) Run terraform destroy.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 2 weeks ago

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

justinretzolk commented 2 weeks ago

Hey @ktrenchev 👋 Thank you for taking the time to raise this! While we understand Terraform configurations can get pretty complicated, it's often quite difficult to reproduce scenarios like this without any logging or configuration samples. Are you able to provide debug logs (redacted as necessary) if you're unable to provide a configuration as you'd initially indicated?

One thing that came to up when taking a quick look at this while triaging was the force_destroy argument of the aws_rds_global_cluster resource, which I believe is meant to help with this scenario. Are you able to confirm whether that argument has been configured?

ktrenchev commented 2 weeks ago

Greetings @justinretzolk!,

Unfortunately I'm unable to provide debug logs. I did play around with the force_destroy argument of RDS Global Cluster resource, but it had no effect. I dug around cluster.go myself and my best estimation is: 1) Either the destruction of the RDS Global Cluster and associated RDS Clusters at one go is intentionally unsupported (AWS docs state something along the lines of "there is no 'one button push' deletion process as RDSs are usually mission critical"). 2) The timeout in waitDBClusterDelete() (called in resourceClusterDelete()) is insufficient as earlier in resourceClusterDelete() RemoveFromGlobalClusterWithContext() is called on the replica and a promotion is triggered.

I'll be happy with a confirmation that the deletion of a Global RDS Cluster and associated RDS Clusters at one go is supported (meaning there is something wrong with my setup, which, unfortunately, is not unlikely).

justinretzolk commented 2 weeks ago

Thanks for the additional information here @ktrenchev 👍 Completely understand re:logging and configuration samples. I'll let someone from the team or community speak to some of the more specifics here.

Edit: I had a thought that using a later provider version may help, given that we've migrated most of the provider to use AWS SDK for Go V2. In doing so, I noticed the following in the release notes for 5.24.0:

resource/aws_rds_cluster: Avoid an error on delete related to unexpected state 'scaling-compute' (https://github.com/hashicorp/terraform-provider-aws/issues/34187)

It may be worth upgrading to at least provider version 5.24.0 and testing again to see if that bug fix resolves your particular issue.

Fadih commented 1 week ago

@justinretzolk do you know what was changed , i still using same aws provider 5.0.0 like before , but since october 13 its start failing , i cant upgrade my provider to new version because i need to do a lot of changes in my terraform infrastructure

Fadih commented 1 week ago

steps to reproduce , 1) create aws global db 2)add cluster on west region with one instance 3) add replica in east region with one instance 4) try to restack the complete cluster using snapshot

you can see that its start deleting the instance in east region , and then when trying to promote east cluster from the global db , it didn't wait to finish promoting and start the deletion directlly , so its failing on Error: waiting for RDS Cluster (xxxx-dr-global-region-us-east-2-cluster) delete: unexpected state 'promoting', wanted target ''. last error: %!s()

Fadih commented 1 week ago

@justinretzolk i already have the force_destroy on the aws_rds_global_cluster resource and it still happen

hashicorp / terraform-provider-aws

[Bug]: Unable to remove RDS Global Cluster and associated RDS Clusters at one go #39909

Terraform Core Version

AWS Provider Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Relevant Error/Panic Output Snippet

Terraform Configuration Files

Steps to Reproduce

Debug Output

Panic Output

Important Factoids

References

Would you like to implement a fix?

Community Note