gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
8.09k stars 981 forks source link

Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

Closed ngearhart closed 1 month ago

ngearhart commented 1 month ago

Describe the bug

When using Terragrunt with S3 Remote State in a Docker container, Terragrunt needs to authenticate to AWS S3 directly (not via underlying terraform). When you are on an EC2 instance that has an IAM Role attached (not access keys), Terragrunt uses the EC2 Metadata API via the underlying AWS Go SDK. This results in very poor performance during the remote state initialization process. On AWS GovCloud us-gov-west-1, the remote state initialization takes >10 seconds in a Docker container, whereas it takes <1 second natively.

Steps To Reproduce

  1. Use S3 Remote State.

    remote_state {
    backend = "s3"
    
    generate = {
    path      = "backend.tf"
    if_exists = "overwrite"
    }
    
    config = {
    encrypt = true
    key     = format("data/%s/terraform.tfstate", path_relative_to_include())
    bucket  = ...
    region  = ...
    skip_bucket_public_access_blocking = true
    dynamodb_table = ...
    }
    }
  2. Create a Docker image with Terragrunt .
    
    FROM alpine:3.20.1 AS builder

Install curl to download kubectl

RUN apk add --no-cache curl aws-cli

Define the kubectl version to download

ARG TOFU_VERSION=1.8.3 ARG TERRAGRUNT_VERSION=0.67.16

Download Tofu

RUN curl -LO https://github.com/opentofu/opentofu/releases/download/v${TOFU_VERSION}/tofu_${TOFU_VERSION}_amd64.apk && \ mv tofu_${TOFU_VERSION}_amd64.apk /usr/local/bin/tofu.apk && \ apk add --allow-untrusted /usr/local/bin/tofu.apk

Download Terragrunt

RUN curl -LO https://github.com/gruntwork-io/terragrunt/releases/download/v${TERRAGRUNT_VERSION}/terragrunt_linux_amd64 && \ mv terragrunt_linux_amd64 /usr/local/bin/terragrunt

Make tofu executable

RUN chmod +x /usr/bin/tofu ; chmod +x /usr/local/bin/terragrunt

environment variables

ENV TERRAGRUNT_TFPATH="tofu" ENV TERRAGRUNT_NON_INTERACTIVE="false" ENV TERRAGRUNT_PROVIDER_CACHE=0 ENV TERRAGRUNT_PARALLELISM=1

Set default entrypioint to bash

ENTRYPOINT ["/bin/bash"]

3. Create an EC2 instance with an IAM role attached with necessary permissions.
4. Exec into the docker image on the EC2 with `docker run -it ... bash`
5. Inside the docker container, run `terragrunt init` (or `terragrunt plan`,`terragrunt apply`, etc any command that uses remote state).
6. Notice that it takes significant time before the underlying `terraform init` runs.
This "significant time" is at least 10x as long as it would be outside the docker container. In fact, in a certain environment I operate in, it is 4-6 minutes which is unbearably long for each terragrunt operation. I can provide more details about this environment privately.

## Expected behavior

The command takes up to a few seconds before actually running the underlying terraform command.

## Logs

Here is an example of debug logs (sanitized for privacy).

$ terragrunt init --terragrunt-log-level debug --terragrunt-debug 21:22:45.713 DEBUG Terragrunt Version: 0.67.1 21:22:45.725 DEBUG Did not find any locals block: skipping evaluation. 21:22:45.731 DEBUG Found locals block: evaluating the expressions. 21:22:45.741 DEBUG Evaluated 2 locals (remaining 0): env, terraform_cache_dir ... env logs ... 21:22:49.344 DEBUG Running command: tofu --version 21:22:49.420 DEBUG tofu version: 1.8.1 21:22:49.420 DEBUG Reading Terragrunt config file at terragrunt.hcl 21:22:49.421 DEBUG Did not find any locals block: skipping evaluation. 21:22:49.424 DEBUG Found locals block: evaluating the expressions. 21:22:49.431 DEBUG Evaluated 2 locals (remaining 0): env, terraform_cache_dir ... env logs ... 21:22:49.464 DEBUG Getting output of dependency .. for config terragrunt.hcl ... dependency logs ... 21:23:06.924 DEBUG Found locals block: evaluating the expressions. 21:23:06.931 DEBUG Evaluated 2 locals (remaining 0): env, terraform_cache_dir 21:23:06.936 DEBUG Found locals block: evaluating the expressions. 21:23:06.937 DEBUG Evaluated 2 locals (remaining 0): env, terraform_cache_dir 21:23:06.940 DEBUG Included config ../../../terragrunt.hcl has strategy shallow merge: merging config in (shallow). 21:23:06.947 DEBUG Found locals block: evaluating the expressions. 21:23:06.949 DEBUG Evaluated 1 locals (remaining 0): env 21:23:06.953 DEBUG Found locals block: evaluating the expressions. 21:23:06.961 DEBUG Evaluated 1 locals (remaining 0): env 21:23:06.970 DEBUG Included config ../../../_env/emr.hcl has strategy shallow merge: merging config in (shallow). 21:23:06.970 DEBUG Detected 1 Hooks 21:23:06.970 INFO Downloading Terraform configurations from ... 21:23:07.022 DEBUG Detected 1 Hooks 21:23:07.024 DEBUG Copying files from... 21:23:07.027 DEBUG Setting working directory to ... 21:23:07.028 DEBUG Generated file .terragrunt-cache/w_zPDJwXr8fxnrUd-w10tIHl8HM/Xz4P-Jhavj4obcO3eEDRzJIDlyI/providers.tf. 21:23:07.028 DEBUG Generated file .terragrunt-cache/w_zPDJwXr8fxnrUd-w10tIHl8HM/Xz4P-Jhavj4obcO3eEDRzJIDlyI/backend.tf. 21:23:07.028 INFO Debug mode requested: generating debug file terragrunt-debug.tfvars.json in working dir ... 21:23:07.071 DEBUG The following variables were detected in the terraform module: 21:23:07.071 DEBUG [...] 21:23:07.071 DEBUG WARN: The variable ssl_certificate was omitted because it is not defined in the terraform module. 21:23:07.071 DEBUG WARN: The variable immtua_endpoint was omitted because it is not defined in the terraform module. 21:23:07.071 DEBUG WARN: The variable custom_logging_filename was omitted because it is not defined in the terraform module. 21:23:07.071 DEBUG WARN: The variable cert_private_key was omitted because it is not defined in the terraform module. 21:23:07.071 DEBUG Variables passed to terraform are located in "sanitized" 21:23:07.071 DEBUG Run this command to replicate how terraform was invoked: 21:23:07.071 DEBUG terraform -chdir="sanitized" init -var-file="sanitized" 21:23:07.072 DEBUG Initializing remote state for the s3 backend 21:23:13.330 DEBUG Verifying AWS S3 Bucket Versioning 21:23:13.337 DEBUG Checking if SSE is enabled for AWS S3 bucket 21:23:13.358 DEBUG Checking if bucket is have root access 21:23:13.366 DEBUG Policy for RootAccess already exists for bucket 21:23:13.366 DEBUG Checking if bucket is enforced with TLS 21:23:13.374 DEBUG Policy for EnforcedTLS already exists for bucket 21:23:13.374 DEBUG S3 bucket is already up to date 21:23:13.374 DEBUG Verifying AWS S3 Bucket Versioning 21:23:19.665 DEBUG Running command: tofu init 21:23:19.750 STDOUT tofu: Initializing the backend... 21:23:23.378 STDOUT tofu: 21:23:23.378 STDOUT tofu: Successfully configured the backend "s3"! OpenTofu will automatically 21:23:23.378 STDOUT tofu: use this backend unless the backend configuration changes. 21:23:23.467 STDOUT tofu: Initializing provider plugins... 21:23:23.468 STDOUT tofu: - Finding hashicorp/random versions matching "3.5.1"... 21:23:23.470 STDOUT tofu: - Finding hashicorp/null versions matching "3.2.1"... ... other providers ... 21:23:31.429 STDOUT tofu: 21:23:31.429 STDOUT tofu: OpenTofu has been successfully initialized! 21:23:31.429 STDOUT tofu: 21:23:31.429 STDOUT tofu: You may now begin working with OpenTofu. Try running "tofu plan" to see 21:23:31.429 STDOUT tofu: any changes that are required for your infrastructure. All OpenTofu commands 21:23:31.429 STDOUT tofu: should now work. 21:23:31.429 STDOUT tofu: If you ever set or change modules or backend configuration for OpenTofu, 21:23:31.429 STDOUT tofu: rerun this command to reinitialize your working directory. If you forget, other 21:23:31.429 STDOUT tofu: commands will detect it and remind you to do so if necessary.


Notice the time difference between the "Initializing remote state for the s3 backend" and the next lines (6 seconds). That does not seem that bad but it's so much worse than outside of the docker container.

## Versions

- Terragrunt version: 0.67.16
- OpenTofu version: 1.8.3
- Environment details: AWS EC2 instance with IAM role attached, inside Docker container

## Workaround

I found a workaround - run the Docker container with Host networking (`docker run --network host --it ... bash`).

## Additional context

I believe this is related to the AWS SDK calling the Instance Metadata service. When I run `netstat`, I see tons of calls to the `.internal` DNS name for the Instance Metadata service (169.254.169.254). My theory is that something is funny with the networking and it leads to slowness but not timeouts/errors.

Admitedly, this might be a problem with the underlying AWS Golang SDK, but I think that is unlikely.
yhakbar commented 1 month ago

Hey @ngearhart ,

I believe that reaching out to instance metadata is one of the first steps in all AWS SDK implementations.

I think a more direct fix for your issue is to take advantage of the disable_bucket_update = true configuration, which will prevent all attempts to update your S3 + DynamoDB backend, avoiding the attempt to authenticate with AWS at all.

Long term, the CLI shouldn't attempt to automatically make any adjustments to backend resources without explicit opt-in. I've shared a proposal to address that here: #3445

Closing this issue, as it's not really something that can be addressed with a change to how Terragrunt works.

ngearhart commented 1 month ago

@yhakbar Understood. Thanks for walking me through that! I'm comfortable with closing this too, and happy to have this record so if anyone else runs into this, they know the workaround and context. Have a great day!