gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
7.94k stars 965 forks source link

Terragrunt hangs at 100% CPU for 5 minutes #2154

Closed magnetik closed 2 years ago

magnetik commented 2 years ago

Hi,

Recently (after upgrating both terraform & terragrunt to latest), terragrunt apply hangs for 5 minutes with 100% CPU.

When doing htop, I've noticed the subprocesses:

  1. terraform init that takes some seconds
  2. terraform outputs that takes some seconds too
  3. terragrunt apply hanging for some long minutes. Each subprocess taking 100% CPU one after another for 4 minutes.

I've tried to add --terragrunt-debug but it only generated the terragrunt-debug.tfvars.json after the 5 minutes and it's some super helpful.

What can I do?

Thanks!

denis256 commented 2 years ago

Hi, running --terragrunt-log-level debug may show more information on performed steps

magnetik commented 2 years ago

Hi,

Thanks for your answer !

So after starting terragrunt, I have

  1. 43 seconds since the start for the terraform init/output
  2. 1 minute after the log "Included config mergin in (shallow)
  3. 2 minutes of "Downloading Terraform from file://"

My terraform files are on a NFS to share with a virtual machine. The performance might not be good enough for terragrunt.

Not sure if there is anything that terragrunt can do.

denis256 commented 2 years ago

Maybe will help set terragrunt-download-dir on a local path to avoid additional requests over NFS creation of .terragrunt-cache

https://terragrunt.gruntwork.io/docs/reference/cli-options/#terragrunt-download-dir

Sinjo commented 2 years ago

Hi, I ran into this myself today and think I might be able to help narrow it down!

In terms of my setup:

First off, I bisected recent releases, and it looks like the issue was introduced in v0.37.3. It seems that the additional hashing work introduced by #2006 can end up adding a lot of overhead on large module trees.

I built a custom binary from that tag and included pprof in it, then used its CPU profiling functionality to generate a flame graph (using 30s of data sampled from when my terragrunt plan invocation was wedged for around 5 minutes at 100% CPU):

Flame graph described above

I'm not familiar enough with this part of Terragrunt to say what the right answer is here, but I'm hoping that this data can help point you in the right direction for a fix.

I'm happy to collect additional data for you, and also try building off a branch if you do have a proposed fix.


As an aside, would you be open to the addition of a flag to Terragrunt that starts a web server with pprof enabled? I ended up compiling a custom version to do this, and I figure it could be a useful debugging flag to have built in. Happy to send a PR if you think it's a good idea!

magnetik commented 2 years ago

Maybe will help set terragrunt-download-dir on a local path to avoid additional requests over NFS creation of .terragrunt-cache

https://terragrunt.gruntwork.io/docs/reference/cli-options/#terragrunt-download-dir

I'm already using the env variable $TERRAGRUNT_DOWNLOAD with the value /tmp/terragrunt-cache but it do not change much.

magnetik commented 2 years ago

I'm confirming that 0.37.2 is not affected and exactly the same code with the same configuration (with $TERRAGRUNT_DOWNLOAD on /tmp) takes 48 seconds

denis256 commented 2 years ago

Hi, good findings, I think it confirms that directory hash calculation consume CPU, I think can be improved by changing hash function

Sinjo commented 2 years ago

So I spent a bit more of last night reading through the code involved, and I just want to confirm that I've understood the aim of #2006 right.

Previously, Terragrunt would always treat local modules as something that needed to be downloaded (well, copied, but go-getter handles either) on every invocation. Unlike remote modules, there was no caching of them.

2006 removed the condition in alreadyHaveLatestCode that always returned false for local modules, which was causing them to be copied into the Terragrunt cache folder on every invocation even if they hadn't changed. It introduced a caching mechanism based on calculating the hash of a module by recursively hashing the file contents of the module's directory tree.

Assuming I've understood that right it seems unlikely that that approach will yield a performance benefit. By the time you've opened a file and calculated its hash, you might as well have copied it (unless the cache directory you're writing to is on extremely slow storage).

I was thinking a little more about this today, and another option might be to calculate a hash of the mtimes of all the files in the module. It would mean that the entire hash could be calculated by recursively running stat on the module's directory tree, rather than having to read each file (the next most expensive call in the flame graph) and put its entire contents through the (comparatively expensive) hash function.

We could also try swapping in something like MD5 instead of SHA256 (this usecase doesn't need the extra collision resistence - we're not trying to fend off malicious actors in our own source trees), but I suspect it would still be quite slow overall - definitely much slower than hashing the <file_path>:<mtime> pairs.

If my understanding above is right, and you think that's a decent path to take, I'm happy to give it a go!

denis256 commented 2 years ago

Hi, above findings are correct, improvement in the identification of changes that will use fewer resources will definitely help

denis256 commented 2 years ago

Improvement released in https://github.com/gruntwork-io/terragrunt/releases/tag/v0.38.4

Sinjo commented 2 years ago

Thanks for all the review feedback @denis256. Awesome to see this released!