hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.34k stars 1.74k forks source link

google provider: Terraform state leaks on interrupt #14782

Open ialidzhikov opened 1 year ago

ialidzhikov commented 1 year ago

Community Note

Terraform Version

% ./terraform --version
Terraform v1.3.9
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v4.53.1
+ provider registry.terraform.io/hashicorp/google-beta v4.53.1
+ provider registry.terraform.io/hashicorp/null v3.2.1

Affected Resource(s)

Terraform Configuration Files

terraform {
  required_providers {
    google = {
      version = "4.53.1"
    }
    google-beta = {
      version = "4.53.1"
    }
    null = {
      version = "3.2.1"
    }
  }
}

provider "google" {
  credentials = "${file("serviceaccount.json")}"
  project     = "project-foo"
  region      = "europe-west1"
}

//=====================================================================
//= Service Account
//=====================================================================

resource "google_service_account" "serviceaccount" {
  account_id   = "shoot--local--foo"
  display_name = "shoot--local--foo"
}

//=====================================================================
//= Networks
//=====================================================================

resource "google_compute_network" "network" {
  name                    = "shoot--local--foo"
  auto_create_subnetworks = "false"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_subnetwork" "subnetwork-nodes" {
  name          = "shoot--local--foo-nodes"
  ip_cidr_range = "10.250.0.0/24"
  network       = google_compute_network.network.name
  region        = "europe-west1"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_router" "router" {
  name    = "shoot--local--foo-cloud-router"
  region  = "europe-west1"
  network = google_compute_network.network.name

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_router_nat" "nat" {
  name                               = "shoot--local--foo-cloud-nat"
  router = google_compute_router.router.name
  region                             = "europe-west1"
  nat_ip_allocate_option             = "AUTO_ONLY"

  source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"
  subnetwork {
    name                    = google_compute_subnetwork.subnetwork-nodes.self_link
    source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
  }
  min_ports_per_vm = "2048"
  enable_endpoint_independent_mapping = false

  log_config {
    enable = true
    filter = "ERRORS_ONLY"
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

//=====================================================================
//= Firewall
//=====================================================================

// Allow traffic within internal network range.
resource "google_compute_firewall" "rule-allow-internal-access" {
  name          = "shoot--local--foo-allow-internal-access"
  network       = google_compute_network.network.name
  source_ranges = ["10.250.0.0/24", "100.96.0.0/16"]
  allow {
    protocol = "icmp"
  }

  allow {
    protocol = "ipip"
  }

  allow {
    protocol = "tcp"
    ports    = ["1-65535"]
  }

  allow {
    protocol = "udp"
    ports    = ["1-65535"]
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_firewall" "rule-allow-external-access" {
  name          = "shoot--local--foo-allow-external-access"
  network       = google_compute_network.network.name
  source_ranges = ["0.0.0.0/0"]

  allow {
    protocol = "tcp"
    ports    = ["443"] // Allow ingress
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

// Required to allow Google to perform health checks on our instances.
// https://cloud.google.com/compute/docs/load-balancing/internal/
// https://cloud.google.com/compute/docs/load-balancing/network/
resource "google_compute_firewall" "rule-allow-health-checks" {
  name          = "shoot--local--foo-allow-health-checks"
  network       = google_compute_network.network.name
  source_ranges = [
    "35.191.0.0/16",
    "209.85.204.0/22",
    "209.85.152.0/22",
    "130.211.0.0/22",
  ]

  allow {
    protocol = "tcp"
    ports    = ["30000-32767"]
  }

  allow {
    protocol = "udp"
    ports    = ["30000-32767"]
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

// We have introduced new output variables. However, they are not applied for
// existing clusters as Terraform won't detect a diff when we run `terraform plan`.
// Workaround: Providing a null-resource for letting Terraform think that there are
// differences, enabling the Gardener to start an actual `terraform apply` job.
resource "null_resource" "outputs" {
  triggers = {
    recompute = "outputs"
  }
}

//=====================================================================
//= Output variables
//=====================================================================

output "vpc_name" {
  value = google_compute_network.network.name
}

output "cloud_router" {
  value = google_compute_router.router.name

}

output "cloud_nat" {
  value = google_compute_router_nat.nat.name
}

output "service_account_email" {
  value = google_service_account.serviceaccount.email
}

output "subnet_nodes" {
  value = google_compute_subnetwork.subnetwork-nodes.name
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

terraform/terraform-provider-gcp to be resilient to interrupts and to do not leak terraform state when interrupt is received. We run terraform in quite automated manner without human interaction. Everytime state leaks, a human operator has to analyse it and fix it manually.

Actual Behavior

terraform/terraform-provider-gcp leaks state when first terraform apply (that creates the resources) is interrupted.

Steps to Reproduce

  1. terraform init

  2. terraform apply -auto-approve

After the apply starts, interrupt in 3-5seconds. Logs:

% terraform apply -auto-approve

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

<omitted>

null_resource.outputs: Creating...
null_resource.outputs: Creation complete after 0s [id=6097200675366732611]
google_service_account.serviceaccount: Creating...
google_compute_network.network: Creating...
google_service_account.serviceaccount: Creation complete after 2s [id=projects/<omitted>/serviceAccounts/shoot--local--foo@<omitted>.iam.gserviceaccount.com]
^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
google_compute_network.network: Still creating... [10s elapsed]
β•·
β”‚ Error: execution halted
β”‚
β”‚
β•΅
β•·
β”‚ Error: execution halted
β”‚
β”‚
β•΅
β•·
β”‚ Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
β”‚
β”‚   with google_compute_network.network,
β”‚   on main.tf line 34, in resource "google_compute_network" "network":
β”‚   34: resource "google_compute_network" "network" {
β”‚
β•΅

Note that the network creation fails with Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled. The network is created in GCP but not saved in the terraform state. In a subsequent terraform apply -auto-approve, it fails with reason that the network already exists:

% terraform apply -auto-approve
null_resource.outputs: Refreshing state... [id=6097200675366732611]
google_service_account.serviceaccount: Refreshing state... [id=projects/<ommitted>/serviceAccounts/shoot--local--foo@<ommitted>.iam.gserviceaccount.com]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

<ommitted>

Plan: 7 to add, 0 to change, 0 to destroy.
google_compute_network.network: Creating...
β•·
β”‚ Error: Error creating Network: googleapi: Error 409: The resource 'projects/<ommitted>/global/networks/shoot--local--foo' already exists, alreadyExists
β”‚
β”‚   with google_compute_network.network,
β”‚   on main.tf line 34, in resource "google_compute_network" "network":
β”‚   34: resource "google_compute_network" "network" {
β”‚
β•΅

The issue is always reproducible.

Important Factoids

References

hao-nan-li commented 1 year ago

For the network already exists error, looks like that could be clean-up on Pantheon.

For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.

ialidzhikov commented 1 year ago

For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.

I don't think it is related to google_compute_network timeouts at all. It is properly described in the issue that it is about interrupt (SIGTERM, Ctrl + C) during google_compute_network creation and that terraform-provider-gcp cannot handle it and leaks terraform state (resource is created in the cloud provider but not present in the terraform state). According to our testing, when the terraform process is interrupted, it exists right away with:

^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
google_compute_network.network: Still creating... [10s elapsed]
β•·
β”‚ Error: execution halted
β”‚
β”‚
β•΅
β•·
β”‚ Error: execution halted
β”‚
β”‚
β•΅
β•·
β”‚ Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
β”‚
β”‚   with google_compute_network.network,
β”‚   on main.tf line 34, in resource "google_compute_network" "network":
β”‚   34: resource "google_compute_network" "network" {
β”‚
β•΅

I don't really see how google_compute_network timeouts are involved here. google_compute_network has already high timeouts: https://github.com/hashicorp/terraform-provider-google/blob/133a9d72f749a7af65607e6cf7002a086d23d1b6/google/resource_compute_network.go#L46-L50

As I already described - when terraform process is interrupted (SIGTERM), it exists right away with the above error and leaks the state/resource.

ialidzhikov commented 1 year ago

My assumption for the issue without checking the code: I know that there is an Operation API in GCP and terraform-provider-gcp should the the following: it firsts creates the resource and then pulls the Operation API to check whether the resource is created. If the resource is added to the state only when the operation is successful, then this might explain this issue. As you see above from the error, it fails to retrieve the operation for the network creation because context is already cancelled (because of the SIGTERM).

UPDATE: An issue from the past that I opened where failing to get the operation (rate limit exceeded) was leaking the resource - ref https://github.com/hashicorp/terraform-provider-google/issues/8655.

kon-angelo commented 1 year ago

I think from the code the culprit is more or less here (L263):

https://github.com/hashicorp/terraform-provider-google/blob/fa925cd5f70e927cdf3333262d54b9f2793f0d8e/google/resource_compute_network.go#L257-L265

The provider basically unsets the ID of the resource if the polling operation fails. In case of the context cancelation the operation is considered as failed, which can be seen by Error waiting for Creating Network log message.

I think that always unsetting the ID of the resource is not helpful: aside from the context cancelation, e.g. network problems may cause the operation request to fail. In general, unsetting the ID only makes sense if in all the cases where the polling operation fails you are certain that the resource is not created, which is not the case here.

This is also a generic problem it seems. From the few resources that I inspected the same logic is implemented everywhere.


Also I am not sure what is gained exactly from unsetting the ID. The subsequent call of the terraform will anyway validate the existence of the resource. Isn't it more beneficial to account for the "worse case" scenario - that is that the resource was created and allow the subsequent calls to manage it without requiring operation intervention ?

apparentlymart commented 1 year ago

Hi all! I normally work on Terraform Core but I'm here today because of an issue opened about this in the main Terraform Core repository about this bug.

I just wanted to confirm that the previous comment seems correct to me: if the request is cancelled then the provider must return a representation of state that is accurate enough for a subsequent plan and apply to succeed, so returning a null object (which is what the SDK does if you set the id to an empty string) should be done only if you are totally sure that no object has been created in the remote system.

In this case it seems like a situation where the remote API requires some polling after the object has been created to wait until it's ready to use. If that's true then a strategy I might suggest is to represent the status of the object as a computed attribute and then if you get cancelled during polling return a new state with the status set to something representing an error.

Then on the next plan you can check during refresh if the object became ready in the meantime anyway, and update the status attribute if so. If the object remains not ready during the planning phase then you could report that the status needs to change to "ready" and mark that change as "Force new" (in the SDK's terminology) so that the provider will have an opportunity to destroy and recreate the object during the next apply.

Other designs are possible too so I'm sharing the above only as one possible approach to avoid this problem. The general form of this idea is that if an object has been created in the remote system then you should always return an object representing it from the "apply" step. You can optionally return an error at the same time if the object is in an unsalvageable state, in which case Terraform will automatically mark is as "tainted" so it'll automatically get planned for replacement on the next plan, or you can handle it more surgically by handling cancellation as a success with the object somehow marked as incomplete so that the next plan and apply can deal with getting it into a working state somehow.

rileykarson commented 1 year ago

Renamed this to cover resources generally.

rileykarson commented 1 year ago

We're always removing the resource from state in generated resources if we get an error when polling, even if the issue is w/ the polling itself. We should identify cases where the resource may have been created better, or consider tainting more often (although that behaviour change across many resources would be too large for a minor version)

Thinking out loud: maybe we could call read even when we get an error, rather than returning immediately- that's probably our most reliable way of telling we succeeded. However, post-creates and similar may pose an issue.