hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.78k stars 9.13k forks source link

[Bug]: Lambda does not take into account possible IAM eventual consistency #29828

Open ascopes opened 1 year ago

ascopes commented 1 year ago

Terraform Core Version

1.3.7, 1.5.4, 1.5.6

AWS Provider Version

4.55.0, 5.1.0, 5.11.0, 5.13.1, 5.x

Affected Resource(s)

aws_lambda_function

Expected Behavior

When I create a lambda, it should successfully complete.

Actual Behavior

Sometimes IAM doesn't update permissions for the lambda fully and this results in a transient permission error. Rerunning the apply may fix the issue, but sometimes we will see the state get corrupt and we will then get a resource conflict immediately on the next run because the lambda will already exist.

Relevant Error/Panic Output Snippet

[2023-03-07T03:15:43.355Z] aws_iam_role.stub_lambda_role: Creating...
[2023-03-07T03:15:43.355Z] aws_iam_role.stub_lambda_role: Creation complete after 0s [id=...]
[2023-03-07T03:15:43.355Z] aws_iam_role_policy_attachment.stub_lamba_exec_role_eni: Creating...
[2023-03-07T03:15:44.522Z] aws_iam_role_policy_attachment.stub_lamba_exec_role_eni: Creation complete after 1s [id=...]
[2023-03-07T03:15:49.646Z] aws_lambda_function.stub_lambda: Creating...
[2023-03-07T03:15:59.885Z] aws_lambda_function.stub_lambda: Still creating... [10s elapsed]
[2023-03-07T03:16:19.480Z] │ Error: creating Lambda Function (...): waiting for completion: unexpected state 'Failed', wanted target 'Active'. last error: InsufficientRolePermissions: The function's execution role doesn't have permission to perform this operation.
[2023-03-07T03:16:19.480Z] │ 
[2023-03-07T03:16:19.480Z] │   with aws_lambda_function.stub_lambda,
[2023-03-07T03:16:19.480Z] │   on lambda.tf line 123, in resource "aws_lambda_function" "stub_lambda":
[2023-03-07T03:16:19.480Z] │    123: resource "aws_lambda_function" "stub_lambda" {
[2023-03-07T03:16:19.480Z] │

Terraform Configuration Files

I've had to omit some information regarding the nature of what this is used for under company policy, but I don't believe this issue is related to config on our side (unless I am missing some timeout setting somewhere), so hopefully this isn't too important.

resource "aws_iam_role" "stub_lambda_role" {
  name               = "StubLambdaRole"
  assume_role_policy = jsonencode({
    Version   = "2012-10-17",
    Statement = [
      {
        Action    = "sts:AssumeRole",
        Principal = {
          "Service" = ["lambda.amazonaws.com"]
        },
        Effect = "Allow"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "stub_lambda_exec_role_eni" {
  role       = aws_iam_role.stub_lambda_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_lambda_permission" "stub_lambda" {
  statement_id  = "AllowExecutionFromAPIGateway"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.stub_lambda.function_name
  principal     = "apigateway.amazonaws.com"
  source_arn    = "${...}/*/*"
}

resource "aws_lambda_function" "stub_lambda" {
  function_name = ...
  description   = ...
  role          = aws_iam_role.stub_lambda_role
  handler       = ...
  runtime       = ...
  s3_bucket     = ...
  s3_key        = ...
  timeout       = ...
  memory_size   = ...
  layers        = [...]
  publish       = true

  vpc_config {
    subnet_ids         = ...
    security_group_ids = ...
  }

  replace_security_groups_on_destroy = true
}

There is honestly nothing special about this. Just a regular lambda and regular IAM policy role. This works 99.9% of the time, just occasionally we see errors like this that are immediately fixed by rerunning the Terraform apply jobs.

Region is eu-west-1 (ireland), if that matters. Unable to try this out on other regions, but given how little this occurs, it isn't easy to reproduce anyway. My guess is a timeout just needs adjusting somewhere or something needs retrying possibly.

Steps to Reproduce

Not entirely sure, this is totally random.

My guess is the Lambda is waiting for a permission to become available and is only waiting for a small amount of time for something to become available in IAM.

https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html specifies that IAM is eventually consistent, so this may well be related to that.

As a vague guess, I think it is something occurring in these lines of code: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/lambda/function.go#L542-L548

Debug Output

Unavailable

Panic Output

Unavailable

Important Factoids

No response

References

No response

Would you like to implement a fix?

No

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

bergingwer07 commented 1 year ago

Hi, we have the same problem. I would agree with @ascopes that this error occurs completely randomly. Rebuilding the apply job without any changes fixes the error. Unfortunately, it is very annoying for us as we have to rebuild our development environment every day and manually trigger the job again at least once a week.

TF core version: 1.3.9 TF vendor version: v5.0.1 AWS region: eu-west-1

Are there any updates on this?

ascopes commented 1 year ago

Just seen this bug again. We have a built-in retry mechanism in our CI that deals with a bug with ElastiCache in the AWS provider not dealing with eventual consistency correctly during tagging... turns out that the state does not get modified correctly after this failure, leading to a resource conflict on the next retry.

Run 1:

│Error: creating Lambda Function (xxx-lambda): waiting for completion: unexpected state 'Failed', wanted target 'Active'. last error: InsufficientRolePermissions: The function's execution role doesn't have permission to perform this operation.

Run 2:

│Error: creating Lambda Function (xxx-lambda): operation error Lambda: CreateFunction, https response error StatusCode: 409, RequestID: xxx, ResourceConflictException: Function already exist: xxx-lambda

This means the state is getting corrupt when this issue occurs.

Is there any chance of this being prioritised? This issue makes it very difficult to automate builds with Terraform since they regularly need human intervention, somewhat defeating the purpose of automating it in the first place...

jurajseffer commented 1 year ago

Typically, your aws_lambda_function would have

depends_on = [aws_iam_role_policy_attachment.stub_lambda_exec_role_eni]

to make sure the IAM permissions are in place before an attempt to create the function is executed since an implicit dependency on the role within the function is not enough and there is a race condition between the function and the attachment.

ascopes commented 1 year ago

I think the recreation of the Lambda is also a bug that needs considering. It would appear the state is not correctly updated when it fails.

jurajseffer commented 1 year ago

@ascopes I guess the state corruption is a specific problem to your deployment since I have seen this error happen every time when IAM role trust policy is misconfigured and it never resulted in Terraform losing the track of the failed lambda creation. It always wanted to recreate it.

ascopes commented 1 year ago

@jurajseffer that is strange, we don't do anything fancy but it just occasionally does this and then fails to track the state. We have an automatic retry mechanism in our builds due to other bugs within the AWS provider that will occasionally fail to deploy consistently, so when this problem occurs, the pipeline will retry and will always give us an error because the function already exists.

All I can think is that it is a bug with how Terraform orders resource creations that it thinks can be run in parallel. E.g. with the missing depends_on you mentioned, Terraform may think the role policy attachment and the lambda itself can be created at the same time since both depend on the IAM role itself.

Purely speculating, but I believe Golang under the hood will randomise the order of iteration across unordered collections like maps, so this might be an artifact of that.

I have been tempted to try this with the awscc provider just to see if it has the same behaviour.

jurajseffer commented 1 year ago

That is precisely what is happening - you're having a race condition where permissions aren't attached and/or loaded up by IAM when you try to create the function because both are being created in parallel. Adding explicit depends_on solves it. I cannot explain your state problems though.

ascopes commented 1 year ago

Can confirm that this still intermittently occurs even with @jurajseffer's suggested changes unfortunately.

resource "aws_iam_role" "role" {
  name = "TestRole"
}

resource "aws_iam_role_policy_attachment" "basic_access" {
  role       = aws_iam_role.role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_lambda_function" "function" {
  depends_on    = [aws_iam_role_policy_attachment.basic_access]
  role          = aws_iam_role.role.arn
  function_name = "TestFunction"
  ...
}

[2023-08-25T11:45:45.661Z] aws_iam_role.role: Creation complete after 0s [id=TestRole]
[2023-08-25T11:45:45.661Z] aws_iam_role_policy_attachment.basic_access: Creating...
[2023-08-25T11:45:45.661Z] aws_iam_role.role: Creation complete after 0s [id=TestRole]
[2023-08-25T11:45:45.661Z] aws_iam_role_policy_attachment.basic_access: Creation complete after 1s [id=TestRole-xxxx]
...
[2023-08-25T11:45:54.333Z] aws_lambda_function.function: Creating...
[2023-08-25T11:46:04.309Z] aws_lambda_function.function: Still creating... [10s elapsed]
...
╷
│ Error: creating Lambda Function (TestFunction): waiting for completion: unexpected state 'Failed', wanted target 'Active'. last error: InsufficientRolePermissions: The function's execution role doesn't have permission to perform this operation.
│ 
│   with aws_lambda_function.function,
│   on lambda.tf line 10, in resource "aws_lambda_function" "function":
│    10: resource "aws_lambda_function" "function" {
│ 
╵

My guess is IAM is not immediately updating the role policy globally before the Lambda gets created, so needs to backoff creation if it encounters this specific error and retry for up to a couple of minutes to rule that out.

JonathanManass commented 1 year ago

I encountered the exact same issue, I managed to fix this only by adding a 1 minute sleep in between them like this

resource "time_sleep" "wait_1_minute" {
  depends_on = [
    aws_iam_role_policy_attachment.basic_access
  ]
  create_duration = "1m"

  lifecycle {
    ignore_changes = all
  }
}

resource "aws_lambda_function" "function" {
  depends_on    = [aws_iam_role_policy_attachment.basic_access]
  role          = aws_iam_role.role.arn
  function_name = "TestFunction"
  ...
  depends_on = [
    aws_iam_role_policy_attachment.basic_access,
    time_sleep.wait_1_minute
  ]
}

Nb: you don't need to redefine aws_iam_role_policy_attachment.basic_access in the depends_on of the function since its implicit through the sleep dependency but it will be needed if it's ever fixed and we can remove the sleep entirely.

prashant0085 commented 5 months ago

I am facing the same issue when trying to add the vpc config in lambda. I tried adding time_sleep as suggested by @JonathanManass and dependency on role attachment as well for lambda, but every time it taints the lambda and thus it is getting re-created causing the below issue to occur again:

Error: creating Lambda Function (example-lambda-prod-eu-west-1): waiting for completion: unexpected state 'Failed', wanted target 'Active'. last error: InsufficientRolePermissions: The function's execution role doesn't have permission to perform this operation.

Also after the Terraform failed the IAM role is created and attached to Lambda and also the vpc config which was part of plan can be seen in Lambda's config.

It is only the terraform which is failing

Terraform Version: 1..3.6 Provider Version: hashicorp/aws v5.44.0

ascopes commented 5 months ago

@prashant0085 can you show your config for this? Unless the role is recreating itself then nothing should be being tainted.

resource "aws_iam_role" "lambda" {
  ...
}

resource "aws_iam_policy_attachmemt" "basic_access" {
  role = aws_iam_role.lambda.name
  // basic access role if not attached
  // to a VPC, or VPC access role if using
  // a VPC.
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "time_sleep" "wait" {
  depends_on = [aws_iam_role_policy_attachment.basic_access]
  create_duration = "1m"
}

resource "aws_lambda_function" "lambda" {
  depends_on = [time_sleep.wait]
  ...
  role = aws_iam_role.lambda.arn
  ...
}

assume your structure matches this?

prashant0085 commented 5 months ago

@ascopes Below is my config:

resource "aws_iam_role" "iam_for_lambda" {
  name = substr("lambda_${var.name}_${var.env}_${var.region}", 0, 64)

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": "LambdaFunctionAssumeRolePermission",
      "Condition": {
        "StringEquals": {
          "aws:SourceArn": "${local.lambda_arn}"
        }
      }
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "vpc_integration" {
  count      = local.vpc_integration_count
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
  role       = aws_iam_role.iam_for_lambda.name
}

resource "aws_iam_role_policy_attachment" "policy_attachment" {
  role       = aws_iam_role.iam_for_lambda.name
  policy_arn = aws_iam_policy.lambda_policy.arn
}

resource "aws_iam_role_policy_attachment" "custom_policies" {
  count      = length(var.custom_policies)
  policy_arn = var.custom_policies[count.index]
  role       = aws_iam_role.iam_for_lambda.name
}

resource "time_sleep" "wait_1_minute" {
  depends_on = [
    aws_iam_role_policy_attachment.custom_policies,
    aws_iam_role_policy_attachment.policy_attachment,
    aws_iam_role_policy_attachment.vpc_integration
  ]
  create_duration = "1m"

  lifecycle {
    ignore_changes = all
  }
}

resource "aws_lambda_function" "lambda" {
  depends_on = [
    aws_cloudwatch_log_group.lambda_log_group,
    time_sleep.wait_1_minute
  ]
  ...
  role          = aws_iam_role.iam_for_lambda.arn
  ...

 dynamic "vpc_config" {
    for_each = var.vpc_integration == null ? [] : [1]
    content {
      security_group_ids = concat(
        var.vpc_integration.sg_ids,
        values(aws_security_group.lambda)[*].id
      )
      subnet_ids = var.vpc_integration.subnet_ids
    }
  }
}

and when I run the job I get below plan

  # module.rds-mysql.module.backup-lambda["diamant"].aws_lambda_function.lambda is tainted, so must be replaced
-/+ resource "aws_lambda_function" "lambda" {
      ~ architectures                  = [
          - "x86_64",
        ] -> (known after apply)
      ~ arn                            = "arn:aws:lambda:eu-west-1:xxxxxxxxxxxx:function:example-eu-west-1" -> (known after apply)
      ~ id                             = "example-eu-west-1" -> (known after apply)
      ~ invoke_arn                     = "arn:aws:apigateway:eu-west-1:lambda:path/xxxx/functions/arn:aws:lambda:eu-west-1:xxxxxxxxxxxxxxx:function:example-eu-west-1" -> (known after apply)
      ~ last_modified                  = "2024-05-03T14:45:50.633+0000" -> (known after apply)
      ~ qualified_arn                  = "arn:aws:lambda:eu-west-1:xxxxxxxx:function:example-lambda:$LATEST" -> (known after apply)
      ~ qualified_invoke_arn           = "arn:aws:apigateway:eu-west-1:lambda:path/2015-03-31/functions/arn:aws:lambda:eu-west-1:xxxxxxxxxxxxxxx:function:example-eu-west-1:$LATEST/invocations" -> (known after apply)
      + signing_job_arn                = (known after apply)
      + signing_profile_version_arn    = (known after apply)
      ~ source_code_size               = 814 -> (known after apply)

      ~ version                        = "$LATEST" -> (known after apply)
        # (16 unchanged attributes hidden)

      ~ ephemeral_storage {
          ~ size = 512 -> (known after apply)
        }

      ~ logging_config {
          + application_log_level = (known after apply)
          ~ log_format            = "Text" -> (known after apply)
          ~ log_group             = "/aws/lambda/exmaple-lmabda" -> (known after apply)
          + system_log_level      = (known after apply)
        }

      ~ tracing_config {
          ~ mode = "PassThrough" -> (known after apply)
        }

      ~ vpc_config {
          ~ vpc_id                      = "vpc-xxxxxxxxxxxxxxxx" -> (known after apply)
            # (3 unchanged attributes hidden)
        }

        # (1 unchanged block hidden)
    }

  # module.rds-mysql.module.backup-lambda["exmaple"].aws_lambda_permission.allow_cloudwatch_to_call_check_foo[0] will be created
  + resource "aws_lambda_permission" "allow_cloudwatch_to_call_check_foo" {
      + action              = "lambda:InvokeFunction"
      + function_name       = "exmaple-function"
      + id                  = (known after apply)
      + principal           = "events.amazonaws.com"
      + source_arn          = "arn:aws:events:eu-west-1:XXXXXXXX:rule/exmaple-eu-west-1"
      + statement_id        = "AllowExecutionFromCloudWatch"
      + statement_id_prefix = (known after apply)
    }

  # module.rds-mysql.module.backup-lambda["exmaple"].time_sleep.wait_1_minute will be created
  + resource "time_sleep" "wait_1_minute" {
      + create_duration = "1m"
      + id              = (known after apply)
    }

and TF apply output

module.rds-mysql.module.backup-lambda["example"].aws_lambda_function.lambda: Destroying... [id=example]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Creating...
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [10s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [20s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [30s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [40s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [50s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Still creating... [1m0s elapsed]
module.rds-mysql.module.backup-lambda["omaccounts"].time_sleep.wait_1_minute: Creation complete after 1m0s [id=2024-05-03T15:03:07Z]
module.rds-mysql.module.backup-lambda["example"].aws_lambda_function.lambda: Creating...
╵
╷
│ Error: creating Lambda Function (example-lmabda): waiting for completion: unexpected state 'Failed', wanted target 'Active'. last error: InsufficientRolePermissions: The function's execution role doesn't have permission to perform this operation.
│ 
│   with module.rds-mysql.module.backup-lambda["examle"].aws_lambda_function.lambda,
│   on .terraform/modules/rds-mysql/lambda/main.tf line 20, in resource "aws_lambda_function" "lambda":
│   20: resource "aws_lambda_function" "lambda" {
│

FYI: only the lambda is getting re-created, IAM role and attachment are not getting re-created and I tried with deleting lambda and IAM role with depends on without the time_sleep and kept getting same error.

Before deleting lambda I used to get same error for not enough permission for lambda to add vpc config.

ascopes commented 5 months ago

@prashant0085 if the vpc integration count is 0, are you adding the regular policy attachment that you need?

https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

It looks like if your Lambda is made with VPC integration on, then you apply with it off, you remove the policy attachment for VPC access, then dont add the basic execution role that is a subset of that. You're also ignoring all changes on the time sleep so I think it won't retrigger that sleep once the policy changes again?

Unless I am mistaken (which I often am!)

prashant0085 commented 5 months ago

@ascopes Yes I am adding all actions present regular policy i.e. AWSLambdaBasicExecutionRole to lambda, as you can see there are 3 policy attachment resource, out which 2 has conditions and 1 attachment always happens, that attachment has the policy which is regular policy as shared below:

dynamic "statement" {
    for_each = var.enable_cloudwatch_logs ? [1] : []
    content {
      effect = "Allow"

      actions = [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ]

      resources = [
        "arn:aws:logs:*:*:*"
      ]
    }
  }

where var.enable_cloudwatch_logs is always true.

As I have said, the terraform fails but the Lambda is still created with required vpc integration and iam role required with all policies

ascopes commented 5 months ago

@prashant0085 I was under the impression AWS Lambda requires CloudWatch to work, although I may be wrong.

Not entirely sure what benefit you get from disabling logs for a Lambda entirely though, as you'd lose any way of monitoring it, including metrics via EMF, Lambda insights, etc?

Even if it is always true, I'd try and make a more minimal working example and remove things you don't need, like the dynamic block... get something as minimal as possible that produces the same issue.

It sounds like this is possibly a logic error, so debugging Terraform to see the API calls will help spot what is changing internally as well.