intermittent errors pulling state after successful plan/apply

cludden commented 4 years ago

often when running many parallel plans or applies against a single resource but with different workspaces, we encounter this intermittent error that "fails" the step

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

# ..

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

Outputs:

  # ..

▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Failed To Run Terraform Apply!

2020/05/08 19:15:08 Apply Error: Error running `state pull`: exit status 1, Output:

ljfranklin commented 4 years ago

@cludden I'm not aware of any reasons why running state pull would fail regularly. I added some additional logging around that command here. That change is present in the ljfranklin/terraform-resource:latest and ljfranklin/terraform-resource:0.12.24 images. Try running for a bit with that change to see if you get a more informative error message.

cludden commented 4 years ago

After much head scratching, I believe this is due to an s3 race condition and happens very intermittently (1/~500 builds). Would you accept a PR that adds some retry logic around this step?

ljfranklin commented 4 years ago

@cludden the S3 backend already retries 5 times by default: https://www.terraform.io/docs/backends/types/s3.html#max_retries. Try checking whether the Terraform code treats your error as retryable, e.g. 404 might not be. Or maybe the sleep between retries is too short. In any case, if at all possible I'd rather any retry logic live in Terraform itself so that all Terraform users get the benefit.

cludden commented 4 years ago

I think the race condition is not in terraform, but instead the resource due to S3 consistency model and by calling state pull so quickly after a successful apply pushes an updated state file. some details from the Amazon S3 data consistency model section of the S3 docs

However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:

A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.

A process replaces an existing object and immediately tries to read it. Until the change is fully propagated, Amazon S3 might return the previous data.

A process deletes an existing object and immediately tries to read it. Until the deletion is fully propagated, Amazon S3 might return the deleted data.

A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.

here is an updated error with the additional logging you added (thanks again btw)!

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

<redacted>

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

<redacted>
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Failed To Run Terraform Apply!

2020/06/18 22:36:21 Apply Error: Error running `state pull`: exit status 1, Output: Failed to refresh state: state data in S3 does not have the expected content.

This may be caused by unusually long delays in S3 processing a previous state

update.  Please wait for a minute or two and try again. If this problem

persists, and neither S3 nor DynamoDB are experiencing an outage, you may need

to manually verify the remote state and update the Digest value stored in the

DynamoDB table to the following value: 575f3c723db817133af25135a8afa327

ljfranklin commented 4 years ago

@cludden Terraform is retrying for 10 seconds before returning that error: https://github.com/hashicorp/terraform/blob/10d94fb764dd7762f3e8343fb7d987056fe9c830/backend/remote-state/s3/client.go#L57-L95. Maybe that hardcoded 10 second value should be bumped or made configurable. I'd still suggest opening an issue/PR on Terraform itself. The goal is that users can run terraform state pull and it just works without users needing to roll their own retry wrappers.

ljfranklin / terraform-resource

intermittent errors pulling state after successful plan/apply #119