EFS file system creation hangs on indefinitely when not available in selected region

duckie commented 6 years ago

Hi

When trying to create an EFS filesystem in a region where the service is not available, Terraform waits indefinitely for the resource to create, though I know from the AWS Console it cannot succeed.

Terraform Version

Terraform v0.11.2-dev

Affected Resource(s)

aws_efs_file_system

Terraform Configuration Files

provider "aws" {
  region = "us-west-1"
}
resource "aws_efs_file_system" "my-efs" {
}

Expected Behavior

Exit on error saying it cannot be done.

Actual Behavior

Loops indefinitely on:

aws_efs_file_system.my-efs: Still creating... (XmXXs elapsed)

Steps to Reproduce

Apply the given configuration.

apparentlymart commented 6 years ago

Hi @duckie!

I expect what's going on here is that there's some retry behavior attempting to work around an "eventual consistency" behavior in the API, and it's being too liberal in which error codes it considers retryable.

This isn't always fixable unfortunately, since AWS APIs don't always return specific-enough error codes to allow us to distinguish cases where we must retry from cases where we shouldn't, but if there is a way to distinguish it then we can hopefully update the retry handling code to make this distinction and return a real error as you suggest.

bflad commented 5 years ago

Hi @duckie! 👋 Sorry for the delayed answer here. It turns out that we improved this behavior a few releases ago by default to retry less times if it detects a more "permanent" network/DNS issue.

The problem here is that some error conditions are easier/quicker to manifest. It seems AWS is publishing DNS records even for regions that are not active. When the AWS Go SDK reaches out in this scenario, it will wait a full 30 second input/output timeout instead of a quick DNS failure.

Using this currently incorrect region for EFS, the behavior is now reduced to roughly 6 minutes with a default provider configuration (much improved from the likely hour or so it would have taken before):

terraform {
  required_version = "0.11.10"
}

provider "aws" {
  region      = "ca-central-1"
  version     = "1.42.0"
}

resource "aws_efs_file_system" "my-efs" {}

(With debug logging enabled)

aws_efs_file_system.my-efs: Still creating... (5m30s elapsed)
aws_efs_file_system.my-efs: Still creating... (5m40s elapsed)
aws_efs_file_system.my-efs: Still creating... (5m50s elapsed)
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: 2018/11/06 19:56:25 [DEBUG] [aws-sdk-go] DEBUG: Send Request elasticfilesystem/CreateFileSystem failed, will retry, error RequestError: send request failed
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: caused by: Post https://elasticfilesystem.ca-central-1.amazonaws.com/2015-02-01/file-systems: dial tcp 92.242.140.21:443: i/o timeout
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: 2018/11/06 19:56:25 [DEBUG] [aws-sdk-go] DEBUG: Retrying Request elasticfilesystem/CreateFileSystem, attempt 10
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: 2018/11/06 19:56:25 [DEBUG] [aws-sdk-go] DEBUG: Request elasticfilesystem/CreateFileSystem Details:
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: ---[ REQUEST POST-SIGN ]-----------------------------
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: POST /2015-02-01/file-systems HTTP/1.1
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: Host: elasticfilesystem.ca-central-1.amazonaws.com
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: User-Agent: aws-sdk-go/1.15.64 (go1.11.1; darwin; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.9-beta1
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: Content-Length: 84
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: Authorization: AWS4-HMAC-SHA256 Credential=AKIAIXP7QHO656Q4VL7Q/20181107/ca-central-1/elasticfilesystem/aws4_request, SignedHeaders=content-length;host;x-amz-date, Signature=2e3a17663ab5cf78a521c8b935cc9c77fcc8361d8dde4d513ea7e403a4c9e00c
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: X-Amz-Date: 20181107T005625Z
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: Accept-Encoding: gzip
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4:
2018-11-06T19:56:25.791-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: -----------------------------------------------------
aws_efs_file_system.my-efs: Still creating... (6m0s elapsed)
2018-11-06T19:56:36.940-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: 2018/11/06 19:56:36 [WARN] Disabling retries after next request due to networking issue
2018-11-06T19:56:36.940-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: 2018/11/06 19:56:36 [DEBUG] [aws-sdk-go] DEBUG: Send Request elasticfilesystem/CreateFileSystem failed, not retrying, error RequestError: send request failed
2018-11-06T19:56:36.940-0500 [DEBUG] plugin.terraform-provider-aws_v1.42.0_x4: caused by: Post https://elasticfilesystem.ca-central-1.amazonaws.com/2015-02-01/file-systems: dial tcp 92.242.140.21:443: connect: connection refused

The [WARN] Disabling retries after next request due to networking issue log line is our networking detection kicking in to turn off retries early.

Its important to note this behavior is also tunable via the max_retries argument for the provider, if desired or acceptable in your environment. e.g.

provider "aws" {
  max_retries = 2
  region      = "ca-central-1"
  version     = "1.42.0"
}

Yields the much faster:

aws_efs_file_system.my-efs: Still creating... (1m20s elapsed)
aws_efs_file_system.my-efs: Still creating... (1m30s elapsed)

Error: Error applying plan:

1 error(s) occurred:

* aws_efs_file_system.my-efs: 1 error(s) occurred:

* aws_efs_file_system.my-efs: Error creating EFS file system: RequestError: send request failed
caused by: Post https://elasticfilesystem.ca-central-1.amazonaws.com/2015-02-01/file-systems: dial tcp 92.242.140.21:443: i/o timeout

Hope this helps! If there is some specific case we're missing here, please reach out.

duckie commented 5 years ago

Hi there

Thank you for the answer. It is rather dubious from AWS to actually expose non-working endpoints, but I get why this is a problem on Terraform side.

Boto manages this by maintaining a list of endpoints and the features they implement, and a specific configuration key must be set to "force" it to try anyway (useful when Boto did not yet register an actually available endpoint). I can get why you dont want to maintain such a reference. Or maybe you could use the one published by Boto and make it an opt-in option ? This is just an idea. This reference is a json file.

Yes, 6 minutes is definitely an improvement.

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

hashicorp / terraform-provider-aws