hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.73k stars 9.09k forks source link

[Bug]: update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency #34697

Open ribbonhood opened 9 months ago

ribbonhood commented 9 months ago

Terraform Core Version

1.6.5

AWS Provider Version

5.29.0

Affected Resource(s)

aws_sfn_state_machine

Expected Behavior

State machine version is updated and pointed to the new alias

Actual Behavior

State machine update times out and fails.

Relevant Error/Panic Output Snippet

╷
│ Error: waiting for Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency
│ 
│   with aws_sfn_state_machine.state_machine_1,
│   on test_sf.tf line 10, in resource "aws_sfn_state_machine" "state_machine_1":
│   10: resource "aws_sfn_state_machine" "state_machine_1" {
│ 
╵

Terraform Configuration Files

data "template_file" "sf_template" {
  template = file("${path.module}/definition.json.tpl")
}

resource "aws_iam_role" "step-functions-role" {
  name = "test_sf_1"
  assume_role_policy = file("${path.module}/step-functions-role.json")
}

resource "aws_sfn_state_machine" "state_machine_1" {
  name     = "test_state_machine_1"
  role_arn = aws_iam_role.step-functions-role.arn
  publish = true

  logging_configuration {
    include_execution_data = false
  }
  definition = data.template_file.sf_template.rendered

  /*lifecycle {
    replace_triggered_by = [value]
  }*/
  timeouts {
    #create = "5m"
    update = "2m"
  }
}

data "aws_sfn_state_machine_versions" "state_machine_1_versions" {
  statemachine_arn = aws_sfn_state_machine.state_machine_1.arn
}

resource "aws_sfn_alias" "sfn_active_alias" {
  name = "test_state_machine_1_active"

  routing_configuration {
    state_machine_version_arn = element(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions, length(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions)-1)
    weight                    = 100
  }

  depends_on = [time_sleep.wait_for_step_function]
}

resource "time_sleep" "wait_for_step_function" {
  create_duration = "30s"
  triggers = {
    role = aws_sfn_state_machine.state_machine_1.arn
  }
}

DEFINITION

{
  "Comment": "Test SM",
  "StartAt": "Step 1",
  "States": {
    "Step 1": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "Func1",
        "Payload": {
          "payload.$": "$",
          "token.$": "$$.Task.Token"
        }
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 5,
          "MaxAttempts":2,
          "BackoffRate":2
        }
      ],
      "Next": "Step 2",
      "ResultSelector": {
        "request.$": "$$.Execution.Input",
        "result.$": "$"
      }
    },
    "Step 2": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "Func2"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 4,
          "MaxAttempts":1,
          "BackoffRate":1
        }
      ],
      "End": true,
      "TimeoutSeconds": 28800
    }
  },
  "TimeoutSeconds": 86430
}

ROLE

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": [
          "states.amazonaws.com"
        ]
      },
      "Effect": "Allow"
    }
  ]
}

Steps to Reproduce

Run terraform apply to create the resources Run terraform apply again, even without making any changes and the update fails.

Debug Output


  http.response.body=
  | {"creationDate":1.701436609239E9,"definition":"{\n  \"Comment\": \"Test SM\",\n  \"StartAt\": \"Step 1\",\n  \"States\": {\n    \"Step 1\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke.waitForTaskToken\",\n      \"Parameters\": {\n        \"FunctionName\": \"Func1\",\n        \"Payload\": {\n          \"payload.$\": \"$\",\n          \"token.$\": \"$$.Task.Token\"\n        }\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 5,\n          \"MaxAttempts\":2,\n          \"BackoffRate\":2\n        }\n      ],\n      \"Next\": \"Step 2\",\n      \"ResultSelector\": {\n        \"request.$\": \"$$.Execution.Input\",\n        \"result.$\": \"$\"\n      }\n    },\n    \"Step 2\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke\",\n      \"Parameters\": {\n        \"Payload.$\": \"$\",\n        \"FunctionName\": \"Func2\"\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 4,\n          \"MaxAttempts\":1,\n          \"BackoffRate\":1\n        }\n      ],\n      \"End\": true,\n      \"TimeoutSeconds\": 28800\n    }\n  },\n  \"TimeoutSeconds\": 86430\n}","loggingConfiguration":{"__type":"com.amazonaws.swf.base.model#LoggingConfiguration","includeExecutionData":false,"level":"OFF"},"name":"test_state_machine_1","revisionId":"72aa6bea-68f1-4a29-8b0a-c193390a4f96","roleArn":"arn:aws:iam::XXXX:role/test_sf_1","stateMachineArn":"arn:aws:states:XXXX:XXXX:stateMachine:test_state_machine_1","status":"ACTIVE","tracingConfiguration":{"__type":"com.amazonaws.swf.base.model#TracingConfiguration","enabled":false},"type":"STANDARD"}```

### Panic Output

_No response_

### Important Factoids

No

### References

_No response_

### Would you like to implement a fix?

None
github-actions[bot] commented 9 months ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

ribbonhood commented 8 months ago

After some tinkering it appears the issue is related to having logging_configuration with level not explicitly set.

logging_configuration {
    include_execution_data = false
}

When no default is set for level, there's a bug that tries to recreate the state machine and in turn I get this error. Explicitly adding level=OFF doesn't recreate the sate machine and updates work as expected.

logging_configuration {
    level = "OFF"
    include_execution_data = false
}

I'll leave this open as it may be an actual bug that needs to be looked into.

brainsiq commented 3 weeks ago

I've had a similar issue which seemed to be caused by not setting kms_data_key_reuse_period_second in encryption_configuration.

Every apply would do an update in place to set the value from 300 (the default) to null and more often than not would produce the same eventual consistency error. It was also updating the version (with publish=true), which stopped happening after adding the encryption setting.