Closed SteveLinden closed 9 months ago
The error to be fixed:
Error: Failed to save state
Error saving state: failed to upload state: operation error S3: PutObject,
failed to rewind transport stream for retry, request stream is not seekable
Error: Failed to persist state to backend
The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.
Running "terraform apply" again at this point will create a forked state,
making it harder to recover.
To retry writing this state, use the following command:
terraform state push errored.tfstate
The fix has been temporarily deployed (see the slack thread) to the scheduled baseline pipeline only and it has already worked with the happy path and a failure on apply. It still needs evidence of errored state failure on apply and a successful state push. This will require some time, but if no pipelines fails due to an errored state for over a week or two, this is probably a good enough test.
Leaving this issue open, for when there is more evidence and to then enrol it to all other pipelines.
Putting it into the blocked column (or feel free to put it back into the backlog, if easier.
https://mojdt.slack.com/archives/C015UBQ78MR/p1702984257007459 << You can see a short Slack conversation with our AWS TAM here where we were given some guidance / linked to the S3 performance design considerations whitepaper.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
Thanks David, but according to the doc, we are not hitting the limit in a single workflow run. FYI, the implemented solution is not looking into the limits, but is to push the errored state.
It appears that the error may be a terraform bug: https://github.com/hashicorp/terraform/issues/34528
We see the same issue in v1.6.6. Terraform trace for more insights:
2024-01-16T14:27:51.0906051Z 2024-01-16T14:27:50.918Z [DEBUG] states/remote: state read lineage is: e29388d3-e6cf-9115-e169-5f1d9f976c58; lineage is: e29388d3-e6cf-9115-e169-5f1d9f976c58
2024-01-16T14:27:51.0907551Z 2024-01-16T14:27:50.934Z [INFO] backend-s3: Uploading remote state: tf_backend.operation=Put tf_backend.req_id=23d0d7ed-07da-d344-e6e7-c8ba9a9e84cd tf_backend.s3.bucket=*** tf_backend.s3.path=***
2024-01-16T14:27:51.0918815Z 2024-01-16T14:27:50.941Z [DEBUG] backend-s3: HTTP Request Sent: aws.region=eu-west-2 aws.s3.bucket=*** aws.s3.key=*** rpc.method=PutObject rpc.service=S3 rpc.system=aws-api tf_aws.sdk=aws-sdk-go-v2 tf_aws.signing_region="" tf_backend.operation=Put tf_backend.req_id=23d0d7ed-07da-d344-e6e7-c8ba9a9e84cd tf_backend.s3.bucket=*** tf_backend.s3.path=*** http.request.header.x_amz_decoded_content_length=129597 http.request.header.authorization="AWS4-HMAC-SHA256 Credential=ASIA************KEUA/20240116/eu-west-2/s3/aws4_request, SignedHeaders=accept-encoding;amz-sdk-invocation-id;content-encoding;content-length;content-type;host;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length;x-amz-sdk-checksum-algorithm;x-amz-security-token;x-amz-server-side-encryption;x-amz-trailer, Signature=*****" http.request.header.x_amz_date=20240116T142750Z http.request.header.x_amz_content_sha256=STREAMING-UNSIGNED-PAYLOAD-TRAILER http.request.header.x_amz_trailer=x-amz-checksum-sha256 net.peer.name=***.s3.eu-west-2.amazonaws.com http.request_content_length=129679 http.request.body="[Redacted: 126.6 KB (129,679 bytes), Type: application/json]" http.url=https://***.s3.eu-west-2.amazonaws.com/***?x-id=PutObject http.request.header.x_amz_security_token="*****" http.request.header.x_amz_acl=bucket-owner-full-control http.request.header.amz_sdk_request="attempt=1; max=5" http.request.header.accept_encoding=identity http.request.header.x_amz_server_side_encryption=AES256 http.request.header.content_encoding=aws-chunked http.request.header.x_amz_sdk_checksum_algorithm=SHA256 http.method=PUT http.user_agent="APN/1.0 HashiCorp/1.0 Terraform/1.6.6 (+https://www.terraform.io) aws-sdk-go-v2/1.24.0 os/linux lang/go#1.21.5 md/GOOS#linux md/GOARCH#amd64 api/s3#1.47.5 ft/s3-transfer" http.request.header.amz_sdk_invocation_id=1fae9720-5924-45a9-832c-2bce77cca664 http.request.header.content_type=application/json
2024-01-16T14:27:51.5915142Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: reading initial snapshot from errored.tfstate
2024-01-16T14:27:51.5916619Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: snapshot file has nil snapshot, but that's okay
2024-01-16T14:27:51.5917812Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: read nil snapshot
2024-01-16T14:27:51.5919595Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: Importing snapshot with lineage "e29388d3-e6cf-9115-e169-5f1d9f976c58" serial 251 as the initial state snapshot at errored.tfstate
2024-01-16T14:27:51.5921167Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: preparing to manage state snapshots at errored.tfstate
2024-01-16T14:27:51.5922242Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: no previously-stored snapshot exists
2024-01-16T14:27:51.5923195Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: state file backups are disabled
2024-01-16T14:27:51.5924429Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: forcing lineage "e29388d3-e6cf-9115-e169-5f1d9f976c58" serial 251 for migration/import
2024-01-16T14:27:51.5925663Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: writing snapshot at errored.tfstate
2024-01-16T14:27:51.5940360Z
2024-01-16T14:27:51.5940616Z Error: Failed to save state
2024-01-16T14:27:51.5940990Z
2024-01-16T14:27:51.5941454Z Error saving state: failed to upload state: operation error S3: PutObject,
2024-01-16T14:27:51.5942382Z failed to rewind transport stream for retry, request stream is not seekable
2024-01-16T14:27:51.5942793Z
2024-01-16T14:27:51.5942939Z Error: Failed to persist state to backend
2024-01-16T14:27:51.5943194Z
2024-01-16T14:27:51.5943479Z The error shown above has prevented Terraform from writing the updated state
2024-01-16T14:27:51.5944159Z to the configured backend. To allow for recovery, the state has been written
2024-01-16T14:27:51.5944977Z to the file "errored.tfstate" in the current working directory.
2024-01-16T14:27:51.5945328Z
2024-01-16T14:27:51.5945594Z Running "terraform apply" again at this point will create a forked state,
2024-01-16T14:27:51.5946094Z making it harder to recover.
2024-01-16T14:27:51.5946293Z
2024-01-16T14:27:51.5946481Z To retry writing this state, use the following command:
2024-01-16T14:27:51.5946923Z terraform state push errored.tfstate
2024-01-16T14:27:51.5947279Z
2024-01-16T14:27:51.6009041Z data.aws_iam_policy_document.assume_role_policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6010040Z data.aws_iam_roles.github_actions_role: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6010940Z module.member-access[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6011803Z module.member-access-us-east[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6012988Z module.member-access-eu-central[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6013994Z module.member-access[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6015084Z module.member-access-us-east[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6016230Z module.member-access-eu-central[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6017237Z module.member-access[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6018331Z module.member-access-us-east[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6019555Z module.member-access-us-east[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6021455Z module.member-access-eu-central[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6022749Z module.member-access[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6023985Z module.member-access-eu-central[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6025019Z module.member-access[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6025868Z module.member-access-eu-central[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6026729Z module.member-access-us-east[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6027463Z data.aws_iam_session_context.whoami: Read complete after 1s [id=<REDACTED>]
2024-01-16T14:27:51.6028171Z data.aws_organizations_organization.root_account: Read complete after 1s [id=<REDACTED>]
2024-01-16T14:27:51.6029100Z module.ssm-cross-account-access.aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6030158Z module.instance-scheduler-access[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6031179Z module.member-access-us-east[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6032185Z module.member-access-eu-central[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6033154Z module.member-access[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
NOTE, s3 bucket and tf backend values were further redacted with ***.
Additionally, the CloudTrail does not show any errors for the above HTTP request:
2024-01-16T14:26:09.295+00:00 {"eventVersion":"1.09","userIdentity":{"type":"AssumedRole","principalId":"AROA5YRRXHENV4XBVCHPR:s3-replication","arn":"arn:aws:sts::946070829339:assumed-role/AWSS3BucketReplication-terraform-state/s3-replication","accountId":"946070829339","accessKeyId":"ASIA5YRRXHENUFNQVHN6","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"AROA5YRRXHENV4XBVCHPR","arn":"arn:aws:iam::946070829339:role/AWSS3BucketReplication-terraform-state","accountId":"946070829339","userName":"AWSS3BucketReplication-terraform-state"},"attributes":{"creationDate":"2024-01-16T14:20:19Z","mfaAuthenticated":"false"}},"invokedBy":"s3.amazonaws.com"},"eventTime":"2024-01-16T14:24:29Z","eventSource":"s3.amazonaws.com","eventName":"PutObject","awsRegion":"eu-west-1","sourceIPAddress":"s3.amazonaws.com","userAgent":"s3.amazonaws.com","requestParameters":{"bucketName":"modernisation-platform-terraform-state-replication","accessControlList":{"x-amz-grant-full-control":"id=\"22f81ca85d0d968a6c79bd16b75cca751d35a70b9a567245a376350149977d4b\", id=\"d26d93f7a0f00df8f4a8d63e50c3a9fa259e7ad1d02069bf40d84dd999c8c41f\""},"Host":"s3.eu-west-1.amazonaws.com","x-amz-server-side-encryption":"AES256","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q","key":"environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate","x-amz-storage-class":"STANDARD"},"responseElements":{"x-amz-server-side-encryption":"AES256","x-amz-expiration":"expiry-date=\"Fri, 16 Jan 2026 00:00:00 GMT\", rule-id=\"main\"","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q"},"additionalEventData":{"SignatureVersion":"SigV4","aclRequired":"Yes","CipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","bytesTransferredIn":484292,"SSEApplied":"SSE_S3","AuthenticationMethod":"AuthHeader","x-amz-id-2":"mbUoyGDeBaax91fI2c5fehUOd40GS/S9kjLtR/TQL45BbvbU9cDpGhQYGKVqmEZ33L0E9fbGEQ0=","bytesTransferredOut":0},"requestID":"P8ZXMGC84BM13J4T","eventID":"b1a78289-1ca6-42b0-8d35-974357919dc0","readOnly":false,"resources":[{"type":"AWS::S3::Object","ARN":"arn:aws:s3:::modernisation-platform-terraform-state-replication/environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate"},{"accountId":"946070829339","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::modernisation-platform-terraform-state-replication"}],"eventType":"AwsApiCall","managementEvent":false,"recipientAccountId":"946070829339","eventCategory":"Data"}
2024-01-16T14:26:55.698+00:00 {"eventVersion":"1.09","userIdentity":{"type":"AWSAccount","principalId":"AROAUJX7QETDMJM3NOGOH:githubactionsrolesession","accountId":"295814833350"},"eventTime":"2024-01-16T14:24:07Z","eventSource":"s3.amazonaws.com","eventName":"PutObject","awsRegion":"eu-west-2","sourceIPAddress":"20.75.95.33","userAgent":"[APN/1.0 HashiCorp/1.0 Terraform/1.6.6 (+https://www.terraform.io) aws-sdk-go-v2/1.24.0 os/linux lang/go#1.21.5 md/GOOS#linux md/GOARCH#amd64 api/s3#1.47.5 ft/s3-transfer]","requestParameters":{"bucketName":"modernisation-platform-terraform-state","Host":"modernisation-platform-terraform-state.s3.eu-west-2.amazonaws.com","x-amz-acl":"bucket-owner-full-control","x-amz-server-side-encryption":"AES256","key":"environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate","x-id":"PutObject"},"responseElements":{"x-amz-server-side-encryption":"AES256","x-amz-expiration":"expiry-date=\"Fri, 16 Jan 2026 00:00:00 GMT\", rule-id=\"main\"","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q"},"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","bytesTransferredIn":484374,"SSEApplied":"SSE_S3","AuthenticationMethod":"AuthHeader","x-amz-id-2":"9DbayMa6lMrO5IKUZQVuCvTzxQ5co5+8IEXD80mEs7gvw8k1YBZTeyOBTXG2YlGAPRQcM682K2EBSo9uylAyyQ==","bytesTransferredOut":0},"requestID":"R9BZEZ499V7M2F1V","eventID":"affb50fc-5365-482e-8551-611f4b8cef94","readOnly":false,"resources":[{"type":"AWS::S3::Object","ARN":"arn:aws:s3:::modernisation-platform-terraform-state/environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate"},{"accountId":"946070829339","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::modernisation-platform-terraform-state"}],"eventType":"AwsApiCall","managementEvent":false,"recipientAccountId":"946070829339","sharedEventID":"06d62d60-ce1d-4507-a9c3-e53f90e69753","eventCategory":"Data","tlsDetails":{"tlsVersion":"TLSv1.2","cipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","clientProvidedHostHeader":"modernisation-platform-terraform-state.s3.eu-west-2.amazonaws.com"}}
which is a good indication that the problem lies on terraform (no issue in cloudtrail and the state was actually saved in this instance, but the terraform still fails).
The state push fix for the state persistence failure is now rolled out to the scheduled baseline workflow with temporarily suppression of slack alerts for when the state push is successful. There will be separate issues to track the fix rollout to other workflows. Also, once https://github.com/hashicorp/terraform/issues/34528 is fixed, the alerting should be re-enabled.
To roll out the fix to other repos/workflows see this issue: https://github.com/ministryofjustice/modernisation-platform/issues/6038
Expected Behavior
The state should save without issues.
Actual Behavior
We get the above error as seen in https://github.com/ministryofjustice/modernisation-platform/actions/runs/7285740796/job/19855252914#step:7:113 as an example
Steps to Reproduce the Problem
Run a full release that amends everything, e.g. adding a role access change. The number that happen is not consistent but it has been happening more recently.
Version
Example is the run for PR #5840
Modules
modernisation-platform
Account
No response