hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.82k stars 9.17k forks source link

aws_api_gateway_deployment + Lambda Race Condition HTTP500, 5s pause workaround #17604

Open namachieli opened 3 years ago

namachieli commented 3 years ago

Community Note

Terraform CLI and Terraform AWS Provider Version

$ terraform -v
Terraform v0.14.6
+ provider registry.terraform.io/hashicorp/aws v3.27.0
+ provider registry.terraform.io/hashicorp/null v3.0.0
+ provider registry.terraform.io/hashicorp/time v0.6.0
+ provider registry.terraform.io/integrations/github v4.4.0

Affected Resource(s)

Terraform Configuration Files

Since funkiness with aws_api_gateway_deployment is well known, I'm just adding the relevant bit.

resource "aws_api_gateway_integration" "POST" {
  cache_key_parameters    = []
  connection_type         = "INTERNET"
  content_handling        = "CONVERT_TO_TEXT"
  http_method             = "POST"
  integration_http_method = "POST"
  passthrough_behavior    = "WHEN_NO_TEMPLATES"
  request_parameters      = {}
  resource_id             = aws_api_gateway_resource.test.id
  rest_api_id             = aws_api_gateway_rest_api.test.id
  timeout_milliseconds    = 29000
  type                    = "AWS"
  uri = aws_lambda_function.test.invoke_arn
  request_templates = {
    "application/json" = <<-EOT
     <...>
    EOT
  }
}

<...>

resource "aws_api_gateway_deployment" "api" {
  rest_api_id = aws_api_gateway_rest_api.test.id
  triggers = {
    redeployment = sha1(jsonencode([
      aws_api_gateway_resource.test.id,
      aws_api_gateway_method.POST.id,
      aws_api_gateway_method.OPTIONS.id,
      aws_api_gateway_integration.POST.id,
      aws_api_gateway_integration.OPTIONS.id
    ]))
  }
  lifecycle {
    create_before_destroy = true
  }
  depends_on = [
    aws_api_gateway_integration.POST,
    aws_api_gateway_integration.OPTIONS,
    aws_api_gateway_method.POST,
    aws_api_gateway_method.OPTIONS,
    aws_api_gateway_integration_response.POST-200,
    aws_api_gateway_integration_response.OPTIONS-200,
  ]
}

resource "aws_api_gateway_stage" "api" {
  cache_cluster_enabled = false
  deployment_id         = aws_api_gateway_deployment.api.id
  rest_api_id           = aws_api_gateway_rest_api.test.id
  stage_name            = "api"
  xray_tracing_enabled  = false
}

resource "aws_api_gateway_method_settings" "api" {
  rest_api_id = aws_api_gateway_rest_api.test.id
  stage_name  = aws_api_gateway_stage.api.stage_name
  method_path = "*/*"
  settings {
    throttling_burst_limit = 5000
    throttling_rate_limit  = 10000
    metrics_enabled = true
  }
}

Expected Behavior

The API Gateway should deploy the stage, and the invoke URL works completely to trigger the backend lambda, and return a HTTP200 to the client.

Output from workaround

body='{"id":"80...22","token":"aW5...kc5","type":1,"user":{"avatar":"ea...b6","discriminator":"2551","id":"24...93","public_flags":0,"username":"Na..."},"version":1}'

edsig='322...40f'
ts='161...'
invoke_url='https://7...0.execute-api.us-west-2.amazonaws.com/{stage}/{resource}'

curl -i -X POST \
> -H 'accept: */*' \
> -H "Content-Type: application/json" \
> -H "x-signature-ed25519: ${edsig}" \
> -H "x-signature-timestamp: ${ts}" \
> -d ${body} ${invoke_url}
HTTP/2 200
date: Fri, 12 Feb 2021 21:10:38 GMT
content-type: application/json
content-length: 11
x-amzn-requestid: e47...403
x-amz-apigw-id: app...A=
x-amzn-trace-id: Root=1-6...1b;Sampled=0

{"type": 1}

Actual Behavior

Invoking the stage's invoke URL correctly passes the BODY of the request to the lambda and is processed correctly by lambda. (evidenced by cloudwatch logs and lambda outputs). However, the invoking client receives an HTTP 500.

Output before workaround

body='{"id":"80...22","token":"aW5...kc5","type":1,"user":{"avatar":"ea...b6","discriminator":"2551","id":"24...93","public_flags":0,"username":"Na..."},"version":1}'

edsig='322...40f'
ts='161...'
invoke_url='https://7...0.execute-api.us-west-2.amazonaws.com/{stage}/{resource}'

curl -i -X POST \
> -H 'accept: */*' \
> -H "Content-Type: application/json" \
> -H "x-signature-ed25519: ${edsig}" \
> -H "x-signature-timestamp: ${ts}" \
> -d ${body} ${invoke_url}
HTTP/2 500
date: Fri, 12 Feb 2021 21:10:36 GMT
content-type: application/json
content-length: 36
x-amzn-requestid: be...992
x-amzn-errortype: InternalServerErrorException
x-amz-apigw-id: app...2Q=

{"message": "Internal server error"}

Steps to Reproduce

You can easily toggle the deployment from TF and the manual deployment in API > Stages > Deployment History

Why is this a race condition?

The problem isn't about what terraform attempts to create, its WHEN it attempts to create it. By manually deploying after terraform apply, you are just doing the same thing as TF did, except every resource has been fully built and linked internally within AWS.

A workaround is to simply add:

resource "time_sleep" "wait" {
  create_duration = "5s"
  depends_on = [
    aws_api_gateway_integration.xxx,
    aws_api_gateway_method.xxx,
    aws_api_gateway_integration_response.xxx,
  ]
}

resource "aws_api_gateway_deployment" "api" {
<...>
  depends_on = [
    time_sleep.wait
  ]
}

This 5s pause allows something on the AWS backend to finish existing in time for the deployment to build correctly. There is likely a lot of other background issues contributing to this, but its easy to call it a race condition since its solvable with a pause.

I also tried moving the dependency logic to the deployment happens normally, but have the time delay gate the aws_api_gateway_stage resource. This always results in the race condition failure so I strongly believed its tied to aws_api_gateway_deployment

References

This lambda and the effective TF config is based on the POC from https://oozio.medium.com/serverless-discord-bot-55f95f26f743.

I am willing to provide a sanitized complete TF if required.

github-actions[bot] commented 10 months ago

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

namachieli commented 9 months ago

Unless this has been solved, I think this issue should stay open.