hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
41.68k stars 9.41k forks source link

Terraform infinite loop when destroy #25203

Closed quentin9696 closed 4 years ago

quentin9696 commented 4 years ago

Terraform Version

Terraform v0.12.26
+ provider.aws v2.57.0
+ provider.null v2.1.2
+ provider.random v2.2.1
+ provider.template v2.1.2

Debug Output

https://gist.github.com/quentin9696/19cfa05818f8a912d4371370deb67a9e

Actual Behavior

I tried to destroy resources using terraform destroy, but after approuve the plan, terraform do nothing, it's like block in an infinite loop. I try to wait 3h before kill process.

I give you a part of the “infinite loop” log, but the file is too big to be copy/paste here.

danieldreier commented 4 years ago

Hi! Thanks for reporting this. I think this is probably a valid issue, and I'd like to reproduce it locally. To do that, I have to be able to run this and run it on my workstation without inventing any details in order to be confident we're seeing the same behavior. As-is, I don't have the terraform code you used to produce this behavior, and so I'm stuck.

Can you please restate your reproduction case such that I can copy-paste it and run it locally? Ideally, this would use the null resource provider rather than a real provider in order to minimize external dependencies, but if you cannot reproduce it without AWS resources I can absolutely reproduce it there as well. If it's too big to copy-paste, you could make a PR against https://github.com/danieldreier/terraform-issue-reproductions as a way to share a reproduction case with me. Without that, I really don't know how to move forward on this.

quentin9696 commented 4 years ago

Hi ! I'm still working to isolate the issue, but, it's very hard to do. I can't give you the code as is, because it's part of a private customer code, but I'll try to obfuscate it and send you a copy.

I can see the this issue is here when I try to call a sub-module, for one case, I call it 16 times, for the second time, I call it 65 times. For the first case, I reduce the sub module call to only 1, and it take about 4 minutes after the destroy plan approval to start destroy my resources. And this time increase with number of sub-module I call.

I can also told you that those submodules have a lot of dependency between modules.

It could be dependency graph calculation, what do you think ?

jbardin commented 4 years ago

Hi @quentin9696,

If smaller test cases do complete, albeit more slowly, it is likely to be the related to handling the dependency graph. Highly connected graphs are very slow to process, and root module outputs being what they are, are often transitively connected to a large portion of the graph.

There will be numerous graph performance improvements included in the 0.13 release, however #25213 looks like it will specifically fix this problem, by reducing the scope of what needs to be traversed.

quentin9696 commented 4 years ago

Hi @jbardin

As I can see, the problem should be solved in terraform 0.13 ? I check you test results in #23811 and see a gain between 50 to 90%.

Nothing for 0.12 ?

danieldreier commented 4 years ago

@quentin9696 we are not planning on backporting any enhancements to 0.12.x, and are only expecting to do more 0.12.x releases if serious, show-stopping bugs or security problems are discovered prior to the 0.13.0 GA release.

I don't quite understand your last response - are you saying that you tested your codebase with the 0.13.0 beta 1 and found that it performs significantly better? It's unclear to me whether 0.13 is enough faster for you to consider this fixed, or if it's still too slow on 0.13.0 to be usable for you.

quentin9696 commented 4 years ago

@danieldreier I just check the perfomance test in the PR #23811. I didn't try to run terraform 0.13-beta1 for now. I'll do it tomorrow.

gui-don commented 4 years ago

@danieldreier Hi! Thanks for your work on it.

For me, it is a show-stopping bug for two reasons.

First, it is very hard to find out what in the code cause issues, because tests needs to create a lot of resources, if we cannot destroy these ressouces, it’s manual delete nightmare.

Second, our client would not be able to destroy their resources. They have a lot of them and the design that better suit them is a deployment per environment (dev, prod, etc.).

danieldreier commented 4 years ago

@gui-don thanks for chiming in! I get that destroying resources is a pretty crucial operation. I want to be clear that the dependency graph performance improvements are involved and I don't think they're really feasible to backport. I'm also expecting 0.13 to be a relatively easy upgrade for most people - way, way easier than the 0.11 -> 0.12 upgrade, because we don't have the same kind of syntax changes we did when we adopted HCL2 in 0.12.0. Please try out 0.13 for your clients. If you're still seeing terribly, unusably slow destroy operations in 0.13.0, that's definitely a big deal and I would like to hear about it.

gui-don commented 4 years ago

@danielreider I understand. We are in the process to test with Terraform 0.13, our code just needed tiny validate adjustments. I’ll keep you up to date with how that goes. We are also trying to figure out what exactly is causing the issue, some of our module’s destroys runs fast other don’t without obvious differences between them.

In the case 0.13 solve our issue:

I see there are no due date for Terraform 0.13, do you have a rough estimation for it? Can we help in any way to accelerate the beta and rc process?

danieldreier commented 4 years ago

@gui-don 0.13.0 GA should ship in about a month, unless we find major bugs in the next beta and RC.

The best way you can help with 0.13 is to test! Beta 2 will be released tomorrow - please try it out, and if you find problems, provide really detailed, exact reproduction cases.

I really appreciate you trying out the beta!

quentin9696 commented 4 years ago

Hi @danieldreier ,

I convert all my modules to terraform 0.13 and can apply with a target, because there is an issue with terraform 0.13 #25307

quentin9696 commented 4 years ago

Hi @danieldreier

Terraform 0.13 beta3 solved this issue.

you can close this issue.

Thank you !

jbardin commented 4 years ago

Thanks @quentin9696!

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.