ribbonhood commented 9 months ago

Terraform Core Version

1.6.5

AWS Provider Version

5.29.0

Affected Resource(s)

aws_sfn_state_machine

Expected Behavior

State machine version is updated and pointed to the new alias

Actual Behavior

State machine update times out and fails.

Relevant Error/Panic Output Snippet

╷
│ Error: waiting for Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency
│ 
│   with aws_sfn_state_machine.state_machine_1,
│   on test_sf.tf line 10, in resource "aws_sfn_state_machine" "state_machine_1":
│   10: resource "aws_sfn_state_machine" "state_machine_1" {
│ 
╵

Terraform Configuration Files

data "template_file" "sf_template" {
  template = file("${path.module}/definition.json.tpl")
}

resource "aws_iam_role" "step-functions-role" {
  name = "test_sf_1"
  assume_role_policy = file("${path.module}/step-functions-role.json")
}

resource "aws_sfn_state_machine" "state_machine_1" {
  name     = "test_state_machine_1"
  role_arn = aws_iam_role.step-functions-role.arn
  publish = true

  logging_configuration {
    include_execution_data = false
  }
  definition = data.template_file.sf_template.rendered

  /*lifecycle {
    replace_triggered_by = [value]
  }*/
  timeouts {
    #create = "5m"
    update = "2m"
  }
}

data "aws_sfn_state_machine_versions" "state_machine_1_versions" {
  statemachine_arn = aws_sfn_state_machine.state_machine_1.arn
}

resource "aws_sfn_alias" "sfn_active_alias" {
  name = "test_state_machine_1_active"

  routing_configuration {
    state_machine_version_arn = element(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions, length(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions)-1)
    weight                    = 100
  }

  depends_on = [time_sleep.wait_for_step_function]
}

resource "time_sleep" "wait_for_step_function" {
  create_duration = "30s"
  triggers = {
    role = aws_sfn_state_machine.state_machine_1.arn
  }
}

DEFINITION

{
  "Comment": "Test SM",
  "StartAt": "Step 1",
  "States": {
    "Step 1": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "Func1",
        "Payload": {
          "payload.$": "$",
          "token.$": "$$.Task.Token"
        }
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 5,
          "MaxAttempts":2,
          "BackoffRate":2
        }
      ],
      "Next": "Step 2",
      "ResultSelector": {
        "request.$": "$$.Execution.Input",
        "result.$": "$"
      }
    },
    "Step 2": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "Func2"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 4,
          "MaxAttempts":1,
          "BackoffRate":1
        }
      ],
      "End": true,
      "TimeoutSeconds": 28800
    }
  },
  "TimeoutSeconds": 86430
}

ROLE

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": [
          "states.amazonaws.com"
        ]
      },
      "Effect": "Allow"
    }
  ]
}

Steps to Reproduce

Run terraform apply to create the resources Run terraform apply again, even without making any changes and the update fails.

Debug Output


  http.response.body=
  | {"creationDate":1.701436609239E9,"definition":"{\n  \"Comment\": \"Test SM\",\n  \"StartAt\": \"Step 1\",\n  \"States\": {\n    \"Step 1\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke.waitForTaskToken\",\n      \"Parameters\": {\n        \"FunctionName\": \"Func1\",\n        \"Payload\": {\n          \"payload.$\": \"$\",\n          \"token.$\": \"$$.Task.Token\"\n        }\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 5,\n          \"MaxAttempts\":2,\n          \"BackoffRate\":2\n        }\n      ],\n      \"Next\": \"Step 2\",\n      \"ResultSelector\": {\n        \"request.$\": \"$$.Execution.Input\",\n        \"result.$\": \"$\"\n      }\n    },\n    \"Step 2\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke\",\n      \"Parameters\": {\n        \"Payload.$\": \"$\",\n        \"FunctionName\": \"Func2\"\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 4,\n          \"MaxAttempts\":1,\n          \"BackoffRate\":1\n        }\n      ],\n      \"End\": true,\n      \"TimeoutSeconds\": 28800\n    }\n  },\n  \"TimeoutSeconds\": 86430\n}","loggingConfiguration":{"__type":"com.amazonaws.swf.base.model#LoggingConfiguration","includeExecutionData":false,"level":"OFF"},"name":"test_state_machine_1","revisionId":"72aa6bea-68f1-4a29-8b0a-c193390a4f96","roleArn":"arn:aws:iam::XXXX:role/test_sf_1","stateMachineArn":"arn:aws:states:XXXX:XXXX:stateMachine:test_state_machine_1","status":"ACTIVE","tracingConfiguration":{"__type":"com.amazonaws.swf.base.model#TracingConfiguration","enabled":false},"type":"STANDARD"}```

### Panic Output

_No response_

### Important Factoids

No

### References

_No response_

### Would you like to implement a fix?

None

github-actions[bot] commented 9 months ago

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

ribbonhood commented 8 months ago

After some tinkering it appears the issue is related to having logging_configuration with level not explicitly set.

logging_configuration {
    include_execution_data = false
}

When no default is set for level, there's a bug that tries to recreate the state machine and in turn I get this error. Explicitly adding level=OFF doesn't recreate the sate machine and updates work as expected.

logging_configuration {
    level = "OFF"
    include_execution_data = false
}

I'll leave this open as it may be an actual bug that needs to be looked into.

brainsiq commented 3 weeks ago

I've had a similar issue which seemed to be caused by not setting kms_data_key_reuse_period_second in encryption_configuration.

Every apply would do an update in place to set the value from 300 (the default) to null and more often than not would produce the same eventual consistency error. It was also updating the version (with publish=true), which stopped happening after adding the encryption setting.

hashicorp / terraform-provider-aws

[Bug]: update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency #34697