Open ialidzhikov opened 1 year ago
For the network already exists
error, looks like that could be clean-up on Pantheon.
For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.
For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.
I don't think it is related to google_compute_network timeouts at all. It is properly described in the issue that it is about interrupt (SIGTERM, Ctrl + C) during google_compute_network creation and that terraform-provider-gcp cannot handle it and leaks terraform state (resource is created in the cloud provider but not present in the terraform state). According to our testing, when the terraform process is interrupted, it exists right away with:
^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
Stopping operation...
google_compute_network.network: Still creating... [10s elapsed]
β·
β Error: execution halted
β
β
β΅
β·
β Error: execution halted
β
β
β΅
β·
β Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
β
β with google_compute_network.network,
β on main.tf line 34, in resource "google_compute_network" "network":
β 34: resource "google_compute_network" "network" {
β
β΅
I don't really see how google_compute_network timeouts are involved here. google_compute_network has already high timeouts: https://github.com/hashicorp/terraform-provider-google/blob/133a9d72f749a7af65607e6cf7002a086d23d1b6/google/resource_compute_network.go#L46-L50
As I already described - when terraform process is interrupted (SIGTERM), it exists right away with the above error and leaks the state/resource.
My assumption for the issue without checking the code: I know that there is an Operation API in GCP and terraform-provider-gcp should the the following: it firsts creates the resource and then pulls the Operation API to check whether the resource is created. If the resource is added to the state only when the operation is successful, then this might explain this issue. As you see above from the error, it fails to retrieve the operation for the network creation because context is already cancelled (because of the SIGTERM).
UPDATE: An issue from the past that I opened where failing to get the operation (rate limit exceeded) was leaking the resource - ref https://github.com/hashicorp/terraform-provider-google/issues/8655.
I think from the code the culprit is more or less here (L263):
The provider basically unsets the ID of the resource if the polling operation fails. In case of the context cancelation the operation is considered as failed, which can be seen by Error waiting for Creating Network
log message.
I think that always unsetting the ID of the resource is not helpful: aside from the context cancelation, e.g. network problems may cause the operation request to fail. In general, unsetting the ID only makes sense if in all the cases where the polling operation fails you are certain that the resource is not created, which is not the case here.
This is also a generic problem it seems. From the few resources that I inspected the same logic is implemented everywhere.
Also I am not sure what is gained exactly from unsetting the ID. The subsequent call of the terraform will anyway validate the existence of the resource. Isn't it more beneficial to account for the "worse case" scenario - that is that the resource was created and allow the subsequent calls to manage it without requiring operation intervention ?
Hi all! I normally work on Terraform Core but I'm here today because of an issue opened about this in the main Terraform Core repository about this bug.
I just wanted to confirm that the previous comment seems correct to me: if the request is cancelled then the provider must return a representation of state that is accurate enough for a subsequent plan and apply to succeed, so returning a null object (which is what the SDK does if you set the id to an empty string) should be done only if you are totally sure that no object has been created in the remote system.
In this case it seems like a situation where the remote API requires some polling after the object has been created to wait until it's ready to use. If that's true then a strategy I might suggest is to represent the status of the object as a computed attribute and then if you get cancelled during polling return a new state with the status set to something representing an error.
Then on the next plan you can check during refresh if the object became ready in the meantime anyway, and update the status attribute if so. If the object remains not ready during the planning phase then you could report that the status needs to change to "ready" and mark that change as "Force new" (in the SDK's terminology) so that the provider will have an opportunity to destroy and recreate the object during the next apply.
Other designs are possible too so I'm sharing the above only as one possible approach to avoid this problem. The general form of this idea is that if an object has been created in the remote system then you should always return an object representing it from the "apply" step. You can optionally return an error at the same time if the object is in an unsalvageable state, in which case Terraform will automatically mark is as "tainted" so it'll automatically get planned for replacement on the next plan, or you can handle it more surgically by handling cancellation as a success with the object somehow marked as incomplete so that the next plan and apply can deal with getting it into a working state somehow.
Renamed this to cover resources generally.
We're always removing the resource from state in generated resources if we get an error when polling, even if the issue is w/ the polling itself. We should identify cases where the resource may have been created better, or consider tainting more often (although that behaviour change across many resources would be too large for a minor version)
Thinking out loud: maybe we could call read even when we get an error, rather than returning immediately- that's probably our most reliable way of telling we succeeded. However, post-creates and similar may pose an issue.
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
Affected Resource(s)
Terraform Configuration Files
Debug Output
N/A
Panic Output
N/A
Expected Behavior
terraform/terraform-provider-gcp to be resilient to interrupts and to do not leak terraform state when interrupt is received. We run terraform in quite automated manner without human interaction. Everytime state leaks, a human operator has to analyse it and fix it manually.
Actual Behavior
terraform/terraform-provider-gcp leaks state when first
terraform apply
(that creates the resources) is interrupted.Steps to Reproduce
terraform init
terraform apply -auto-approve
After the apply starts, interrupt in 3-5seconds. Logs:
Note that the network creation fails with
Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
. The network is created in GCP but not saved in the terraform state. In a subsequentterraform apply -auto-approve
, it fails with reason that the network already exists:The issue is always reproducible.
Important Factoids
References