Retry unlocking state when it fails

nitzanm commented 2 years ago

Current Terraform Version

Terraform v0.12.9

Use-cases

When working in a distributed team or automated pipeline, storing state remotely (say in Azure Blob Storage) has huge advantages. However, there is a persistent issue that makes this much more cumbersome: state that stays locked when it shouldn't.

The internet being what it is, sometimes the request to unlock the state file fails. In this case, the pipeline can't run and nobody can apply anything until it's manually unlocked by a human. Knowing whether the state is locked because someone else is still doing a long apply, or whether it's locked in error, means manually polling every team member.

This ends up being one of the most time-consuming parts of using Terraform for us. It happens to us around once a week.

Releasing state lock. This may take a few moments...

Error releasing the state lock!

Error message: failed to retrieve lock info: Get https://xxxx.blob.core.windows.net/xxxxx/terraform.tfstate?comp=metadata: read tcp 192.168.x.x:xxx->52.179.144.64:443: read: connection reset by peer

Attempted Solutions

The only solution we've found is to manually unlock the state file via the Azure portal, after manually checking to make sure nobody else is legitimately holding the lock.

Proposal

Given how damaging a state file that remains locked is, I would like to propose that Terraform makes multiple attempts to unlock the state file if the first one fails - say perhaps 3 attempts with 5 second waits in between them. It would be even better if these parameters were configurable.

I'd also like to propose that if the state file unlock fails, Terraform return a non-zero exit code. The way I see it, a run in which the state file was left locked in error, is not a successful run. Failing the run in an automated pipeline would page someone who could see the problem and manually unlock the state file.

References

N/A

rjpearce commented 2 years ago

@nitzanm can you share the common causes that result in your state file becoming locked? Sorry if this seems obvious but If you can address the common causes of the state file becoming locked then you be less reliant on terraform unlock.

nitzanm commented 2 years ago

@rjpearce 100% of the times our state file stays locked, it's due to Terraform failing to unlock it. My best understanding is that this is due to a temporary network issue - as evidenced by the error message:

Releasing state lock. This may take a few moments...

Error releasing the state lock!

Error message: failed to retrieve lock info: Get https://xxxx.blob.core.windows.net/xxxxx/terraform.tfstate?comp=metadata: read tcp 192.168.x.x:xxx->52.179.144.64:443: read: connection reset by peer

We are running Terraform on Azure and it's accessing Azure APIs - so I think we are as "close" as possible to the APIs and I'm not sure there's much we can do to eliminate these temporary issues (though I'm open to suggestions!). For what it's worth, our Terraform runs also fail sometimes because of errors accessing various other Azure APIs (storage, load balancing, networking, etc.) - but those can all be retried because Terraform is idempotent. In this instance, a single failure means every future run will fail until we manually intervene.

Did I understand your question correctly?

jbardin commented 2 years ago

Thanks for filing the issue @nitzanm,

Ideally this logic would be implemented by the individual remote state implementations, since Terraform does not know what might constitute a fatal error as opposed to a retry-able one. Either ensuring that implementations will retry as needed, or devising a retry mechanism will be something to consider as we design a new interface for remote state storage.

Thanks!

nitzanm commented 2 years ago

@jbardin I agree that it would be nice to make this logic part of the remote state implementation. Correct me if I'm wrong - those implementations are part of Terraform core (not part of a provider for example), so these changes would still need to be against Terraform core, right?

jbardin commented 2 years ago

The remote state implementations currently live within this codebase, but are generally owned by other teams or external contributors: (https://github.com/hashicorp/terraform/blob/main/CODEOWNERS). The prevailing idea for a new model is a plugin based system similar to how providers currently work, which will give implementors more independence with creating remote storage solutions.

hashicorp / terraform