gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
8.04k stars 975 forks source link

Terragrunt destroy hangs forever on component with dependency #2484

Closed tkflexys closed 1 year ago

tkflexys commented 1 year ago

terragrunt version: 0.42.8 (also tested with latest and something as old as 0.39.1) terraform version: 1.3.5

Structure

dependency_inputs.hcl
merged_inputs.hcl  
terragrunt.hcl

dependency_inputs.hcl

dependency "owner_account" {
  config_path = "${get_terragrunt_dir()}/../owner_account/"
}
inputs = {
  "sa-email": "${dependency.owner_account.outputs.sa-email}"
}

merged_inputs.hcl

include "dependency_inputs" {
  path =  "./dependency_inputs.hcl"
  merge_strategy = "deep"
}

terragrunt.hcl

locals {
  merged_inputs = read_terragrunt_config("./merged_inputs.hcl")
  state_config = {
    ....omitted
  }
}

#this is how local.merged_inputs is accessed
#not really relevant as tg still hangs if I remove this
inputs = merge(
  local.merged_inputs.inputs,
  ...ommitted
)

I ran it with debug and these are the few last lines of the output

DEBU[0002] Found locals block: evaluating the expressions.  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] Did not find any locals block: skipping evaluation.  prefix=[/home/src/terragrunt/terragrunt-platform/generators] 
DEBU[0002] Evaluated 3 locals (remaining 0): input_mapping, input_mapping_array, root_provider_versions  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] [Partial] Included config /home/src/terragrunt/terragrunt-platform/components/owner_account/module.hcl has strategy shallow merge: merging config in (shallow).  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] Found locals block: evaluating the expressions.  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] Evaluated 2 locals (remaining 0): generate_environment_script, gen_env_noop  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] [Partial] Included config /home/src/terragrunt/terragrunt-platform/generators/hooks.hcl has strategy shallow merge: merging config in (shallow).  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 

For context, owner_account does not have any dependencies defined and I can run tg destroy just fine inside that component folder whereas the one inside the project folder which depends on owner_account fails

The second I comment out the line which calls read_terragrunt_config the execution seems to continue which seems like is the cause, I can run other commands like init, plan, and apply just fine with this setup

tkflexys commented 1 year ago

further investigation, if I make the following change the command doesn't hang but I am obviously asked for the broken dependency input now

dependency_inputs.hcl

dependency "owner_account" {
  config_path = "${get_terragrunt_dir()}/../owner_account/"
  skip_outputs = true
}
inputs = {
}
DEBU[0012] Running command: terraform destroy            prefix=[/home/src/terragrunt/environments/dev/pentest/components/project] 
var.sa-email
tkflexys commented 1 year ago

last lines of strace where the program hangs

tgkill(1383189, 1386309, SIGURG)        = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 20873377
futex(0x20383f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x20382f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x20382f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2035aa8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000100948, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 824643629056
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 824643629056
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 824643629056
futex(0xc000100548, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1383189, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 5
futex(0x20383f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x20382f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x20383c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x20382f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2035aa8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2035aa8, FUTEX_WAIT_PRIVATE, 0, NULL
tkflexys commented 1 year ago

strace with the -f flag shows a bit more

[pid 2399902] ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
[pid 2399902] ioctl(1, TIOCGWINSZ, {ws_row=48, ws_col=123, ws_xpixel=0, ws_ypixel=0}) = 0
[pid 2399902] newfstatat(AT_FDCWD, "/home/src/terragrunt/environments/dev/pentest/components/project/../owner_account/", {st_mode=S_IFDIR|0775, st_size=4096, ...}, 0) = 0
[pid 2399902] newfstatat(AT_FDCWD, "/home/src/terragrunt/environments/dev/pentest/components/owner_account/terragrunt.hcl", {st_mode=S_IFREG|0664, st_size=1888, ...}, 0) = 0
[pid 2399902] newfstatat(AT_FDCWD, "/home/src/terragrunt/environments/dev/pentest/components/owner_account/terragrunt.hcl", {st_mode=S_IFREG|0664, st_size=1888, ...}, 0) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=979887146} <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 986, NULL, 20485767901306) = 0
[pid 2399886] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=2, tv_nsec=463901066}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 2474, NULL, 20488242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=988660601}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 9996, NULL, 20498242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=989009073}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 9990, NULL, 20508242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=988959570}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 9989, NULL, 20518242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=989032403}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 9989, NULL, 20528242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=989701179}) = -1 ETIMEDOUT (Connection timed out)
[pid 2399886] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2399902] <... epoll_pwait resumed>[], 128, 9992, NULL, 20538242055626) = 0
[pid 2399902] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2399902] epoll_pwait(4,  <unfinished ...>
[pid 2399886] <... nanosleep resumed>NULL) = 0
[pid 2399886] futex(0x2816fd8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=988813689}^C <unfinished ...>
[pid 2399885] <... futex resumed>)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
strace: Process 2399885 detached
strace: Process 2399886 detached
strace: Process 2399887 detached
strace: Process 2399888 detached
strace: Process 2399889 detached
strace: Process 2400305 detached
strace: Process 2399902 detached
tkflexys commented 1 year ago

After running and investigating the code myself I've found the offending line https://github.com/gruntwork-io/terragrunt/blob/68120e20c750c6761d6f961cf8eae3399ce1d361/config/dependency.go#L435

Nothing is executed after the Lock() call on line 435

tkflexys commented 1 year ago

Did some more debugging and if I disable CheckDependentModules, the program no longer hangs https://github.com/gruntwork-io/terragrunt/blob/68120e20c750c6761d6f961cf8eae3399ce1d361/cli/cli_app.go#L373

tkflexys commented 1 year ago

after further investigation I found that the program freezes on this line https://github.com/gruntwork-io/terragrunt/blob/68120e20c750c6761d6f961cf8eae3399ce1d361/config/locals.go#L172

This happens when it checks for dependent modules and seems to hang when checking sub-dependencies i.e some_module > project > owner_account It stops when going over some_module

tkflexys commented 1 year ago

I've been trying to debug this all day and I found some strange behavior

At the root of our project we have a .git folder, if I rename that folder to something else like .notgit, the problem almost goes away and only happens once every 10 times I run terragrunt destroy

The repo itself isn't very large:

→ du -sh .git
4.5M    .git

→ find .git/ -type f | wc -l
805
→ mv .git .notgit
→ cd /home/src/terragrunt/environments/dev/pentest/components/project

Run 1 Success

○ → terragrunt destroy
WARN[0004] No double-slash (//) found in source URL /module-noop.git. Relative paths in downloaded Terraform code may not work. 
module.......: Refreshing state... [id=.....]

Run 2 Hang

DEBU[0002] [Partial] Included config /home/src/terragrunt/terragrunt-platform/components/owner_account/module.hcl has strategy shallow merge: merging config in (shallow).  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] Found locals block: evaluating the expressions.  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] Evaluated 2 locals (remaining 0): generate_environment_script, gen_env_noop  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 
DEBU[0002] [Partial] Included config /home/src/terragrunt/terragrunt-platform/generators/hooks.hcl has strategy shallow merge: merging config in (shallow).  prefix=[/home/src/terragrunt/environments/dev/pentest/components/owner_account] 

Run 3 Success

Plan: 0 to add, 0 to change, 37 to destroy.

Changes to Outputs:
.....

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: 

Run 4 Hang Run 5 Success Run 6 Success Run 7 Success Run 8 Success Run 9 Success Run 10 Success Run 11 Success Run 12 Success Run 13 Hang Run 14 Success Run 15 Success

tkflexys commented 1 year ago

I've created a project which with I can reproduce the problem 100% of the time https://github.com/tkflexys/nothing

To get this to work you have to modify 2 files to configure a gcs project and bucket Replace redacted with a valid project and bucket https://github.com/tkflexys/nothing/blob/master/environments/dev/pentest/components/owner_account/terragrunt.hcl#L18 https://github.com/tkflexys/nothing/blob/master/environments/dev/pentest/components/project/terragrunt.hcl#L18

cd environments/dev/pentest/components/owner_account
terragrunt init
terragrunt apply
cd ../project
terragrunt init
terragrunt apply
terragrunt destroy --terragrunt-log-level=trace

This will hang so you can interrupt it after a few minutes

DEBU[0000] Did not find any locals block: skipping evaluation.  prefix=[/home/src/nothing/environments/dev/pentest/components/owner_account] 
DEBU[0000] [Partial] Included config /home/src/nothing/platform/modules/owner_account_module/module.hcl has strategy shallow merge: merging config in (shallow).  prefix=[/home/src/nothing/environments/dev/pentest/components/owner_account]

from the root of the repo

mv .git .notgit
cd environments/dev/pentest/components/project

This should work now, it hung for me on the first run

terragrunt destroy --terragrunt-log-level=trace
denis256 commented 1 year ago

Hello, thanks for the investigation, I will take a look into fix

dod38fr commented 1 year ago

Here's a workaround:

Hope this helps

tkflexys commented 1 year ago

Here's a workaround:

* dump the inputs of the dependencies with `terragrunt render-json`

* extract the input values: `cat terragrunt_rendered.json | jq '.inputs' > inputs.json`

* copy this file `cp inputs.json .terragrunt-cache/...`

* go where the terraform files are generated: `cd .terragrunt-cache/...`

* run destroy `terraform destroy -var-file=inputs.json`

Hope this helps

Thanks a lot that was very helpful.

For anyone else that comes across this, we also had to collapse our list of inputs and instead opted in to pass them via TF_VAR variables

jq '.inputs' terragrunt_rendered.json > "inputs.json"
jq -r 'to_entries | .[] | select(.value | type != "object" and type != "array") | "TF_VAR_\(.key)=\(.value | @sh)"' "inputs.json" > env-vars
jq -r 'to_entries | .[] | select(.value | type == "object" or type == "array") | "TF_VAR_\(.key)=\(.value | @json | @sh)"' "inputs.json" >> env-vars

set -a
source env-vars
set +a
denis256 commented 1 year ago

Hi, in my tests it started to work after fix from https://github.com/gruntwork-io/terragrunt/releases/tag/v0.46.1

tkflexys commented 1 year ago

Tested this locally and it works! Thanks for the fix