Open cludden opened 4 years ago
@cludden I'm not aware of any reasons why running state pull
would fail regularly. I added some additional logging around that command here. That change is present in the ljfranklin/terraform-resource:latest
and ljfranklin/terraform-resource:0.12.24
images. Try running for a bit with that change to see if you get a more informative error message.
After much head scratching, I believe this is due to an s3 race condition and happens very intermittently (1/~500 builds). Would you accept a PR that adds some retry logic around this step?
@cludden the S3 backend already retries 5 times by default: https://www.terraform.io/docs/backends/types/s3.html#max_retries. Try checking whether the Terraform code treats your error as retryable, e.g. 404 might not be. Or maybe the sleep between retries is too short. In any case, if at all possible I'd rather any retry logic live in Terraform itself so that all Terraform users get the benefit.
I think the race condition is not in terraform, but instead the resource due to S3 consistency model and by calling state pull so quickly after a successful apply pushes an updated state file. some details from the Amazon S3 data consistency model section of the S3 docs
However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors: A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list. A process replaces an existing object and immediately tries to read it. Until the change is fully propagated, Amazon S3 might return the previous data. A process deletes an existing object and immediately tries to read it. Until the deletion is fully propagated, Amazon S3 might return the deleted data. A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.
here is an updated error with the additional logging you added (thanks again btw)!
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
<redacted>
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Outputs:
<redacted>
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Failed To Run Terraform Apply!
2020/06/18 22:36:21 Apply Error: Error running `state pull`: exit status 1, Output: Failed to refresh state: state data in S3 does not have the expected content.
This may be caused by unusually long delays in S3 processing a previous state
update. Please wait for a minute or two and try again. If this problem
persists, and neither S3 nor DynamoDB are experiencing an outage, you may need
to manually verify the remote state and update the Digest value stored in the
DynamoDB table to the following value: 575f3c723db817133af25135a8afa327
@cludden Terraform is retrying for 10 seconds before returning that error: https://github.com/hashicorp/terraform/blob/10d94fb764dd7762f3e8343fb7d987056fe9c830/backend/remote-state/s3/client.go#L57-L95. Maybe that hardcoded 10 second value should be bumped or made configurable. I'd still suggest opening an issue/PR on Terraform itself. The goal is that users can run terraform state pull
and it just works without users needing to roll their own retry wrappers.
often when running many parallel plans or applies against a single resource but with different workspaces, we encounter this intermittent error that "fails" the step