hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.61k stars 9.55k forks source link

Performance issues when referencing high cardinality resources #26355

Open jbardin opened 4 years ago

jbardin commented 4 years ago

Terraform needs to lookup resource as complete objects containing all instances in order to be able to index them in expressions. When the number of instances is large, this can make expression evaluation take considerably longer.

locals {
    users = yamldecode(file("../users.yaml"))
}

resource "null_resource" "user" {
    for_each = local.users

    triggers = {
        name = each.key
        groups = join(",", each.value)
    }
}

resource "null_resource" "user_group" {
    for_each = local.users

    triggers = {
        name = null_resource.user[each.key].triggers.name
        groups = join(",", each.value)
    }
}

Original comment below:

This issue seems to be exacerbated to an extreme degree by the use of for_each.

@eriksw Thanks for detailed notes and the test case! I've encountered similar issue while trying to create a large number of users (~1k) in AWS - after some experiments I ended up with another test case demonstrating performance issue with certain usage of for_each - https://github.com/mlosev/tf-for_each-performance-test

The issue manifests for both Terraform 0.12.29 and 0.13.0 - it takes Terraform an order of magnitude longer (i.e. 10x+) to plan "slow" configuration versus "fast" configuration for the same number of users and groups (~1000) and empty local state (i.e. Terraform wants to create all resources) The key difference between configuration is

--- users-0.13-fast/main.tf     2020-08-26 11:06:16.000000000 +0100
+++ users-0.13-slow/main.tf     2020-08-26 11:06:26.000000000 +0100
@@ -15,7 +15,7 @@
     for_each = local.users

     triggers = {
-        name = each.key
+        name = null_resource.user[each.key].triggers.name
         groups = join(",", each.value)
     }
 }

On my laptop, the difference in plan time is as follows:


  300.20 real       467.68 user        19.38 sys
- "fast" configuration

Plan: 2100 to add, 0 to change, 0 to destroy.


    2.71 real         7.77 user         1.66 sys


This is quite unexpected behaviour for Terraform, especially considering that there are no API calls involved with `null_resource` in this case...

PS. I also tried `-parallelism=100` (just in case) for "slow" configuration, and it didn't change anything time-wise.

_Originally posted by @mlosev in https://github.com/hashicorp/terraform/issues/18981#issuecomment-680805188_
mhvelplund commented 4 years ago

I made a test where i created a 1000 buckets in two ways: with a 1000 resource statements and with a loop on a set with a 1000 entries. Run time was essentially the same.

Then I extended the test to have a thousand output statements, one for each bucket. When referencing the static version runtime was a few seconds more, but when referecing the ones created with for_each the runtimes 30 times as long.

For fun I tried using count to loop, and when referencing those, the runtime was 40 times higher!

Test code here: https://github.com/mhvelplund/tf_speed_test

dimisjim commented 3 years ago

I am experiencing a similar issue with dynamic blocks and nested for_each usage inside them.

example:

resource "aws_s3_bucket" "retention_bucket" {
  count         = local.condition-3 ? 1 : 0
  bucket        = var.bucket_name
  tags          = var.tags
  acl           = var.acl
  force_destroy = var.force_destroy

  versioning {
    enabled = var.enable_versioning
  }

  dynamic "lifecycle_rule" {
    for_each = concat(local.lifecycle, var.custom_lifecycle_rules)

    content {
      abort_incomplete_multipart_upload_days = lookup(lifecycle_rule.value, "abort_incomplete_multipart_upload_days", null)
      enabled                                = lifecycle_rule.value.enabled
      id                                     = lifecycle_rule.value["id"]
      tags                                   = lookup(lifecycle_rule.value, "tags", null)

      dynamic "expiration" {
        for_each = lifecycle_rule.value["expiration"]

        content {
          days = expiration.value["days"]
        }
      }
...
kostyaplis commented 3 years ago

Has it been fixed? I can't see any related changelog, but apply time for my high cardinality setup of 10k resources got dropped > 10x since upgraded 0.14.7 to 1.0.0

jbardin commented 3 years ago

HI @kostyaplis, there were some more performance improvements to the underlying graph in that time period which you are probably noticing. Unfortunately resources like this can still cause performance issues, but it's good to hear that improvements elsewhere have reduced the impact caused by configurations like this!

devopsrick commented 3 years ago

Is this issue still being worked on? We have several workspaces with very large for_each sets of resources and they take so very long to plan (vs count resources).

raptor75 commented 3 years ago

+1 to this. we have a deployment with over 3500 resources and it is taking 18 minutes to finish the plan just to refresh (no changes). 1500 of these are in a for_each set which we are currently working on an alternative which will hopefully help.

To give more details about our environment, we are using a private build agent on aVM (made sure it has enough RAM and CPU never reaches 100% and has accelerated networking), with the state saved in a Premium storage account for low latency. Everything is in the same region.

danilomnds commented 2 years ago

This is part of the module that we have here for azure firewall policies. When we started to use this code the plan execution was taking 3 minutes or so. Now it's taking 15 minutes. We started with less than 10 rules and now we have approximately 40 rules.

I checked with the Azure support and there is no issue in the Azure side. I mean the ARM (Azure Resource Manager API) is responding fast.

resource "azurerm_firewall_policy_rule_collection_group" "rules" { name = "${var.name}_rules" firewall_policy_id = azurerm_firewall_policy.policy.id priority = var.priority

dynamic "application_rule_collection" { for_each = var.app_rules iterator = app_rules content { name = app_rules.key priority = app_rules.value.priority action = lookup(app_rules.value, "action", "Allow") dynamic "rule" { for_each = app_rules.value.rules content { name = rule.key destination_fqdn_tags = lookup(rule.value, "destination_fqdns_tags", []) destination_fqdns = lookup(rule.value, "destination_fqdns", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) dynamic "protocols" { for_each = lookup(rule.value, "protocols",lookup(rule.value, "protocol", {})) content { port = lookup(protocols.value, "port", 443) type = lookup(protocols.value, "type", "Https") } } } } } } dynamic "network_rule_collection" { for_each = var.net_rules iterator = net_rules content { name = net_rules.key priority = net_rules.value.priority action = lookup(net_rules.value, "action", "Allow") dynamic "rule" { for_each = net_rules.value.rules content { name = rule.key destination_addresses = lookup(rule.value, "destination_addresses", []) destination_fqdns = lookup(rule.value, "destination_fqdns", []) destination_ip_groups = lookup(rule.value, "destination_ip_groups", []) destination_ports = lookup(rule.value, "destination_ports", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) protocols = lookup(rule.value, "protocols", ["Any"]) } } } } dynamic "nat_rule_collection" { for_each = var.nat_rules iterator = nat_rules content { name = nat_rules.key priority = nat_rules.value.priority action = "Dnat" dynamic "rule" { for_each = nat_rules.value.rules content { name = rule.key destination_address = lookup(rule.value, "destination_address", "") destination_ports = lookup(rule.value, "destination_ports", []) source_addresses = lookup(rule.value, "source_addresses", []) source_ip_groups = lookup(rule.value, "source_ip_groups", []) protocols = lookup(rule.value, "protocols", []) translated_address = lookup(rule.value, "translated_address", "") translated_port = lookup(rule.value, "translated_port", "") } } } } }

devopsrick commented 2 years ago

We have half a dozen workspaces now that take 30-60 minutes to plan because of this issue. It would be great if it got some attention.

patalwell commented 2 years ago

+1 I have a use case whereby we'd like to update firewall configuration objects; the for_each works fine for a few hundred address objects; but when we start to scale to an order of magnitude we lose performance. We have close to 29k objects we'd like to generate and the plan hangs after 48 hours. I've put the TF client in DEBUG mode and don't seem to have a clue as to why the TF client is hanging on the plan.

This may sound rather naive, but I would imagine partitioning the array being fed to a for_each clause would help improve performance e.g. launch a new thread for ever N number of objects I'd like to create and maintain the state according to a concurrent hash table or some other atomic composite type.

I'll try to fish around the core code to see if there are any clues to how to improve performance. Kind of a newbie here so forgive me if this comes across as axiomatic . I was also considering ditching the TF client all together and just building my own tools to provision infrastructure with an easily configurable thread pool and workers.

matttrach commented 2 months ago

Reproduced in v1.5.7

jbardin commented 2 months ago

35558 Should go a long way towards mitigating the performance issues when large number of resource instances need evaluation.