Open jbardin opened 4 years ago
I made a test where i created a 1000 buckets in two ways: with a 1000 resource statements and with a loop on a set with a 1000 entries. Run time was essentially the same.
Then I extended the test to have a thousand output statements, one for each bucket. When referencing the static version runtime was a few seconds more, but when referecing the ones created with for_each the runtimes 30 times as long.
For fun I tried using count to loop, and when referencing those, the runtime was 40 times higher!
Test code here: https://github.com/mhvelplund/tf_speed_test
I am experiencing a similar issue with dynamic blocks and nested for_each usage inside them.
example:
resource "aws_s3_bucket" "retention_bucket" {
count = local.condition-3 ? 1 : 0
bucket = var.bucket_name
tags = var.tags
acl = var.acl
force_destroy = var.force_destroy
versioning {
enabled = var.enable_versioning
}
dynamic "lifecycle_rule" {
for_each = concat(local.lifecycle, var.custom_lifecycle_rules)
content {
abort_incomplete_multipart_upload_days = lookup(lifecycle_rule.value, "abort_incomplete_multipart_upload_days", null)
enabled = lifecycle_rule.value.enabled
id = lifecycle_rule.value["id"]
tags = lookup(lifecycle_rule.value, "tags", null)
dynamic "expiration" {
for_each = lifecycle_rule.value["expiration"]
content {
days = expiration.value["days"]
}
}
...
Has it been fixed?
I can't see any related changelog, but apply
time for my high cardinality setup of 10k resources got dropped > 10x since upgraded 0.14.7 to 1.0.0
HI @kostyaplis, there were some more performance improvements to the underlying graph in that time period which you are probably noticing. Unfortunately resources like this can still cause performance issues, but it's good to hear that improvements elsewhere have reduced the impact caused by configurations like this!
Is this issue still being worked on? We have several workspaces with very large for_each sets of resources and they take so very long to plan (vs count resources).
+1 to this. we have a deployment with over 3500 resources and it is taking 18 minutes to finish the plan just to refresh (no changes). 1500 of these are in a for_each set which we are currently working on an alternative which will hopefully help.
To give more details about our environment, we are using a private build agent on aVM (made sure it has enough RAM and CPU never reaches 100% and has accelerated networking), with the state saved in a Premium storage account for low latency. Everything is in the same region.
This is part of the module that we have here for azure firewall policies. When we started to use this code the plan execution was taking 3 minutes or so. Now it's taking 15 minutes. We started with less than 10 rules and now we have approximately 40 rules.
I checked with the Azure support and there is no issue in the Azure side. I mean the ARM (Azure Resource Manager API) is responding fast.
resource "azurerm_firewall_policy_rule_collection_group" "rules" { name = "${var.name}_rules" firewall_policy_id = azurerm_firewall_policy.policy.id priority = var.priority
dynamic "application_rule_collection" { for_each = var.app_rules iterator = app_rules content { name = app_rules.key priority = app_rules.value.priority action = lookup(app_rules.value, "action", "Allow") dynamic "rule" { for_each = app_rules.value.rules content { name = rule.key destination_fqdn_tags = lookup(rule.value, "destination_fqdns_tags", []) destination_fqdns = lookup(rule.value, "destination_fqdns", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) dynamic "protocols" { for_each = lookup(rule.value, "protocols",lookup(rule.value, "protocol", {})) content { port = lookup(protocols.value, "port", 443) type = lookup(protocols.value, "type", "Https") } } } } } } dynamic "network_rule_collection" { for_each = var.net_rules iterator = net_rules content { name = net_rules.key priority = net_rules.value.priority action = lookup(net_rules.value, "action", "Allow") dynamic "rule" { for_each = net_rules.value.rules content { name = rule.key destination_addresses = lookup(rule.value, "destination_addresses", []) destination_fqdns = lookup(rule.value, "destination_fqdns", []) destination_ip_groups = lookup(rule.value, "destination_ip_groups", []) destination_ports = lookup(rule.value, "destination_ports", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) protocols = lookup(rule.value, "protocols", ["Any"]) } } } } dynamic "nat_rule_collection" { for_each = var.nat_rules iterator = nat_rules content { name = nat_rules.key priority = nat_rules.value.priority action = "Dnat" dynamic "rule" { for_each = nat_rules.value.rules content { name = rule.key destination_address = lookup(rule.value, "destination_address", "") destination_ports = lookup(rule.value, "destination_ports", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) protocols = lookup(rule.value, "protocols", []) translated_address = lookup(rule.value, "translated_address", "") translated_port = lookup(rule.value, "translated_port", "") } } } } }
We have half a dozen workspaces now that take 30-60 minutes to plan because of this issue. It would be great if it got some attention.
+1 I have a use case whereby we'd like to update firewall configuration objects; the for_each works fine for a few hundred address objects; but when we start to scale to an order of magnitude we lose performance. We have close to 29k objects we'd like to generate and the plan hangs after 48 hours. I've put the TF client in DEBUG mode and don't seem to have a clue as to why the TF client is hanging on the plan.
This may sound rather naive, but I would imagine partitioning the array being fed to a for_each clause would help improve performance e.g. launch a new thread for ever N number of objects I'd like to create and maintain the state according to a concurrent hash table or some other atomic composite type.
I'll try to fish around the core code to see if there are any clues to how to improve performance. Kind of a newbie here so forgive me if this comes across as axiomatic . I was also considering ditching the TF client all together and just building my own tools to provision infrastructure with an easily configurable thread pool and workers.
Reproduced in v1.5.7
Terraform needs to lookup resource as complete objects containing all instances in order to be able to index them in expressions. When the number of instances is large, this can make expression evaluation take considerably longer.
Original comment below:
@eriksw Thanks for detailed notes and the test case! I've encountered similar issue while trying to create a large number of users (~1k) in AWS - after some experiments I ended up with another test case demonstrating performance issue with certain usage of
for_each
- https://github.com/mlosev/tf-for_each-performance-testThe issue manifests for both Terraform 0.12.29 and 0.13.0 - it takes Terraform an order of magnitude longer (i.e. 10x+) to plan "slow" configuration versus "fast" configuration for the same number of users and groups (~1000) and empty local state (i.e. Terraform wants to create all resources) The key difference between configuration is
On my laptop, the difference in plan time is as follows:
Plan: 2100 to add, 0 to change, 0 to destroy.