hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.6k stars 9.54k forks source link

Introduce a pre-plan command #33260

Closed tarciosaraiva closed 1 year ago

tarciosaraiva commented 1 year ago

Terraform Version

1.4.6

Use Cases

Minimize planning time by identifying which resources have changed according to the latest state so they can be targeted individually.

Attempted Solutions

Attempted at trying to identify the resources by using a mixture of git diff commands and shell script to manipulate the text but wasn't very successful.

Proposal

Introduce a new option named preplan which would do exactly what plan does today without the refresh but instead of outputting the changes it would provide a list of resources that would be modified.

With this data in hand we target individual resources instead of the whole plan.

The reason for this is that if the provider being used targets an API where latency is very high the planning phase takes a very long time. In a terraform module where many resources are managed in a collection the output of preplan could work as an input list to a targeted plan, reducing the planning time and only modifying the resources that got affected.

Inspiration for this is the buildkite plugin for monorepos where we instruct it to look at different parts of the repo and only build what is needed.

This would be great for situations where we manage a collection of resources and only want to affect a single item within that collection.

References

No response

jbardin commented 1 year ago

Hi @tarciosaraiva,

Thanks for filing the issue. Can you explain further how this would be different from the current plan with -refresh=false? If I understand correctly, you are looking for the resources which require updates based on the prior state and configuration (without reading the remote API), which is precisely what -refresh=false already does.

Thanks!

tarciosaraiva commented 1 year ago

Hi @tarciosaraiva,

Thanks for filing the issue. Can you explain further how this would be different from the current plan with -refresh=false? If I understand correctly, you are looking for the resources which require updates based on the prior state and configuration (without reading the remote API), which is precisely what -refresh=false already does.

Thanks!

Hi @jbardin , thanks for your prompt reply.

I tried the -refresh=false approach and noticed a slightly improvement in terms of build time but no enough to give us the edge I'm looking for. Let me explain a bit more of what our problem is as I'm starting to believe it might be related to poor architecture in our terraform module.

We are using a popular opensource provider that integrates with Kafka and we use to manage topics. The current state for one of our environments has more than 4K resources managed in a single plan (that's the architecture bit I'm referring to).

When I tried terraform plan -refresh=false it looked like terraform did not try to go out to Kafka to fetch the information as I did not see the usual refreshing state message on the logs. When I turned TF_LOG=DEBUG I saw a lot of this:

2023-05-26T11:34:43.590+1000 [DEBUG] ReferenceTransformer: "kafka_topic.topics[\"my-topic-name\"]" references: []
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "local.envs_topics"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "local.topics"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "var.environment_partition_map"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "var.environment_partition_map"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "var.default_replication_factor"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "var.default_config"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "local.dwh_topics_extra_config"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "var.config_hard_override"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"
2023-05-26T11:34:43.590+1000 [INFO]  ReferenceTransformer: reference not found: "each.value"

I saw that's terraform code and not the provider code and I don't really understand why it spat that out but I could not find any evidence on the logs that a call was being made to the Kafka broker so I'm assuming that terraform plugin infrastructure prevents that from happening and the providers don't have to do anything special - would be good to have clarity on that, hence my submission.

Now given that my assumption is correct and -refresh=false works as we both expect then I can only say that the state is too big and it will take a long time for terraform to process that amount of resources, irrespective if there's latency involved or not.

Looking forward to your feedback. Thanks!

jbardin commented 1 year ago

Hi @tarciosaraiva,

Those log messages are for debug purposes, and not very useful without context. Using -refresh=false does always skip the ReadResource rpc call, which is often the slowest part of planning.

The plugin is still active however, in that we must get the resource schema to decode the config and state, and call PlanResourceChange to determine what if any change there is for each resource. While almost always an offline operation, some providers may require API calls to make certain planning decisions as well. Depending on where the time is spent, you might be able to speed up the offline operations by increasing the -parallelism flag value to allow more concurrent calls to the provider in order to process resource changes more efficiently.

tarciosaraiva commented 1 year ago

Hi @jbardin thank you for that information. I tried to increase the parallelism but wasn't successful with improving the time. Based on my latest understanding, a refactor of our module seems more appropriate and a few colleagues suggested we use terragrunt so we might give that a go.

Thank you again for being so prompt.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.