hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.4k stars 9.5k forks source link

Graph building is slow with very large numbers of dependencies #27523

Open imyousuf opened 3 years ago

imyousuf commented 3 years ago

Terraform Version

0.14.4

Terraform Configuration Files

https://github.com/imyousuf/terraform-aws-webhook-broker/blob/improvements/main.tf

Debug Output

TF Cloud: run-kExnECzF5U5aRXML-plan-log.txt Desktop: plan-trace-localmachine.zip In the logs around 2021/01/16 11:27 there is stop of all log outputs for 7 minutes and log starts again; please search for the first occurrence of the following to find that in log - 2021/01/16 11:34:39 [TRACE] attachDataDependenciesTransformer

Crash Output

Expected Behavior

The Plan finishes successfully and fast enough

Actual Behavior

In TF Cloud it times out but locally it takes ages to run a single plan

Steps to Reproduce

  1. terraform init
  2. terraform plan

Additional Context

The slowness seems to be associated with me refactoring reusable modules and then adding modules in depends_on.

References

jbardin commented 3 years ago

Hi @imyousuf,

The apparent delay in processing the plan can be hard to determine, but you are on the right track with the attachDataDependenciesTransformer. It seems you have created a rather large configuration, with 10s of thousands of dependencies to process, which can take a considerable amount of time to collect and store.

The depends_on meta-parameter should be needed for resources only occasionally, and for modules it is almost never required. Even when depends_on is usable with a module, there is always a better way to achieve the desired result by configuring the data flow to the specific resources that require the dependency, rather than to all module resources at once.

The result shown here is the current expected behavior in this case, but since we don't have an issue for performance with large numbers of dependencies, I'm going to re-tag this as an enhancement request. Due to how terraform needs to track dependencies between resources, I would not expect any huge gains here in the near future, but there may be some optimizations that could be made.

Thanks!

imyousuf commented 3 years ago

@jbardin Thank you for the detailed comment. I did remove the explicit dependency and that seems to fix the issue with plan.

I do agree that there is an "issue" in the core on how dependencies are treated; if you do not mind, can you please point me to the code for this? I would just like to see where the process is spending cycles; I believe its in sort-ordering of dependency graph; so would love to explore how it is implemented.

Here is why I added the dependency -

module.eks.module.eks.aws_security_group_rule.workers_ingress_self[0]: Destruction complete after 0s
module.eks.module.eks.aws_security_group_rule.workers_ingress_cluster_https[0]: Destruction complete after 1s
module.eks.module.eks.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 1s
module.eks.module.eks.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 2s

Error: Delete "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused
Error: could not open kubeconfig "kubeconfig_test-eks-w7b6": stat kubeconfig_test-eks-w7b6: no such file or directory

The conundrum that I was in is, if I do not have dependencies the helm resources were create so far apart that kube token would expire and create would error out; re-run would pick up from where it errored out but during destroy I would not be able to delete the helm resources consistently; now that I moved to file based k8s config; the config file gets deleted before helm resources are all deleted :) and hence now those will error out and to complete the destroy completely, need to run it repeatedly and finally manual deletion of the tfstate file.

In a nutshell; without dependency plan works fine, but destroy is a PIA.

jbardin commented 3 years ago

Hi @imyousuf, I'm not sure the implementation is going to shed any light on why the destroy isn't working smoothly, but that log message is out of date and the relevant function is now called attachDataResourceDependenciesTransformer.

With the size of the configuration it would take me some time to fully understand what you're trying to do here, but the described behavior is consistent with having independent data sources and managed resources that represent the same object. If that guess is correct, it is better to pass the values from the initial resource rather than try and coordinate a separate data source to read it.

imyousuf commented 3 years ago

@jbardin Thank you again for thoughtful detailed comment. I was actually looking for the code from the log which takes long time during plan phase. Will check that pointer, thank you!

On the point of destroy here is what, my observation, is happening -

  1. We spin up a EKS Cluster
  2. The EKS cluster writes to a kubeconfig file locally as part of above step which we use to configure kubernetes and helm providers.
  3. Then we have 2 modules for k8s tools and our project
  4. The implicit dependency is with EKS Cluster ID only.

When destroy is triggered what happens is, because there is no explicit dependency set on the file output and providers or modules, the file gets deleted at first before the module depending on that file is still not deleted. So EKS login fails because lack of config file; so what I tried is I wrote a script of sequential partial deletes before full destroy and it worked.