gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
8.01k stars 971 forks source link

interleaving in `output-all -json` results in malformed JSON output #458

Open mrtyler opened 6 years ago

mrtyler commented 6 years ago

I've been having intermittent failures in my CI pipeline, wherein terragrunt output-all -json produces invalid JSON.

I'm using terragrunt 0.14.0 and terraform 0.11.3. Both are a little out-of-date but I don't see anything in the release notes for either project discussing this type of problem.

This is the normal result of output-all in a directory containing two simple terraform modules for AWS resources, one for a Route 53 zone, the other for some EBS volumes:

{
  "cluster_dns_zone_id": {
    "sensitive": false,
    "type": "string",
    "value": "Z3VJB5SSD3OAIA"
  },
  "couchdb_us-east-2a_volume_id": {
    "sensitive": false,
    "type": "string",
    "value": "vol-0ff1954f00be78317"
  },
  "couchdb_us-east-2b_volume_id": {
    "sensitive": false,
    "type": "string",
    "value": "vol-086760a15bb8721b9"
  },
  "couchdb_us-east-2c_volume_id": {
    "sensitive": false,
    "type": "string",
    "value": "vol-061504ced17c2063a"
  },
  "prometheus_us-east-2b_volume_id": {
    "sensitive": false,
    "type": "string",
    "value": "vol-00f0b7af4b1f5b2ad"
  },
  "prometheus_us-east-2c_volume_id": {
    "sensitive": false,
    "type": "string",
    "value": "vol-0c3f62cf34df57f00"
  }
}

Yet, about 5% of the time, the output looks something like this instead:

{
    "couchdb_us-east-2a_volume_id": {
        "sensitive": false,
        "type": "string",
        "value": "vol-0ff1954f00be78317"
    },
    "couchdb_us-east-2b_volume_id": {
        "sensitive": false,
{
        "type": "string",
    "cluster_dns_zone_id": {
        "value": "vol-086760a15bb8721b9"
        "sensitive": false,
    },
        "type": "string",
    "couchdb_us-east-2c_volume_id": {
        "value": "Z3VJB5SSD3OAIA"
    }
        "sensitive": false,
}
        "type": "string",
        "value": "vol-061504ced17c2063a"
    },
    "prometheus_us-east-2b_volume_id": {

I'm able to reproduce the problem (~5% of the time, i.e. after ~20 iterations) with this dumb one-liner:

count=1 ; rc=0 ; while [ "$rc" == "0" ] ; do output=$(cd ../prereqs/stg/ && TMPDIR=/tmp/tyler/rake-tmp/stg-prereqs/ terragrunt output-all --terragrunt-non-interactive -json --terragrunt-ignore-dependency-errors | tee tee.json | jq -s add) ; rc=$? ; echo "FINISHED - COUNT = $count RC = $rc OUTPUT = $output" ; count=$((count + 1)) ; done

Kinda ugly but I wanted to include the whole thing in case I'm doing something that is causing shell IO to freak out and trigger this problem. (The real problem happens in a slightly more complex environment, where terragrunt is invoked in ruby code via backticks.)

I realize this is a tricky type of problem to pin down and I haven't drilled down to a minimum repro case, but I thought I'd start with a bug report :).

brikis98 commented 6 years ago

I think this is pretty straight forward actually: Terragrunt does not make any guarantees about the order of the output for xxx-all commands. It runs terraform output in parallel across all your modules, so the output will definitely end up interweaving from time to time. Doing things in parallel makes perfect sense for apply-all, but perhaps not for other commands.

There are a few options:

  1. Save log output from xxx-all commands to file (#74).
  2. Reformat the output so you get all the output for one module at a time (#78, ish). The downside is that you wouldn't see the output from modules that complete early for a long time.
  3. Run output-all and perhaps plan-all serially, rather than in parallel.
mrtyler commented 6 years ago

Thanks for the explanation, Yevgeniy.

Doing things in parallel makes perfect sense for apply-all, but perhaps not for other commands.

It definitely violates my expectations to say "output-all -json may return something that is not valid JSON" :).

I think your proposed solutions are all reasonable. 3 seems the simplest, but may be slow in practice for users with a lot of modules. 1 feels a little clunky; instead of terragrunt output-all | whatever I have to go hunting for log files, though the file-based approached has other advantages as noted in #74.

For my use case, 2 looks best and the downside doesn't hurt much because I'm sending all the output to something that processes JSON anyway.

FWIW I'm working around this problem by re-trying when the result of output-all is invalid. I don't know how well this approach will scale, but it works for me for now.