hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.33k stars 1.73k forks source link

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

Open gygitlab opened 2 years ago

gygitlab commented 2 years ago

Community Note

Noticed recently we were getting some "'subnetworks/default' is not ready" errors in our Terraform runs on new environments:

Error: Error creating instance: googleapi: Error 400: The resource 'projects/<redacted>/regions/us-east1/subnetworks/default' is not ready, resourceNotReady

We haven't seen this before but the reason I think this is happening is we have both VMs as well as a GKE cluster being created here at the same time.

When the GKE Cluster starts to be created it's coming in and adding it's additional ranges to the target subnet and that in turn is then "locking" the subnet for sometime - preventing other resources from being created against it.

Rerunning the apply works fine so it feels like this could be handle more gracefully by either doing some retries or holding the cluster build until other resources are done? We could workaround this on our end but at the same time it doesn't seem unreasonable to deploy both a Cluster and other resources to the same subnet so thought it was worth raising.

Terraform Version

1.1.4

Affected Resource(s)

Terraform Configuration Files

resource "google_container_cluster" "cluster" {
  count = min(local.total_node_pool_count, 1)
  name  = var.prefix

  remove_default_node_pool = true
  initial_node_count                = 1
  enable_shielded_nodes       = true

  network    = local.vpc_name # Default
  subnetwork = local.subnet_name # Default

  # Require VPC Native cluster
  # https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/using_gke_with_terraform#vpc-native-clusters
  # Blank block enables this and picks at random
  ip_allocation_policy {}

  release_channel {
    channel = "STABLE"
  }

  node_config {
    shielded_instance_config {
      enable_secure_boot = var.machine_secure_boot
    }
  }
}

resource "google_compute_instance" "node" {
  count                = var.node_count
  name                = "${local.name_prefix}-${count.index + 1}"
  machine_type = var.machine_type

  allow_stopping_for_update = var.allow_stopping_for_update

  shielded_instance_config {
    enable_secure_boot = var.machine_secure_boot
  }

  boot_disk {
    initialize_params {
      image = var.machine_image
      size  = var.disk_size
      type  = var.disk_type
    }
  }

  metadata = {
    enable-oslogin = "TRUE"
  }

  network_interface {
    network    = var.vpc
    subnetwork = var.subnet
  }

  service_account {
    scopes = concat(["storage-rw"], var.scopes)
  }

  lifecycle {
    ignore_changes = [
      min_cpu_platform
    ]
  }
}

Expected Behavior

The provider should gracefully handle any timing clashes caused by the Cluster when on the same subnet

Actual Behavior

The provider creates the Cluster at the same time as VMs. The VMs as a result get a 400 error from the API as the Cluster edits the subnet to add in more ranges.

Steps to Reproduce

  1. Configure a VPC Native cluster and several VMs to deploy on the same subnet
  2. Attempt to apply and notice sometimes the VMs fail to deploy due to the above 400 error

Important Factoids

References

b/300616739

shuyama1 commented 2 years ago

Hi @grantyoung. We should already retry when APIs return this error. Would you mind sharing your debug log?

gygitlab commented 2 years ago

Hi @shuyama1. The log can be seen here thanks.

rileykarson commented 1 year ago

We run into this in our nightly tests a lot. It's definitely not service-specific, I'm gonna reclassify as provider-wide.

SarahFrench commented 1 year ago

Discussion from triage: a possible way to fix this issue is to implement a retry in the provider

melinath commented 1 year ago

We already have a retry in the provider for exactly this case: https://github.com/hashicorp/terraform-provider-google/blob/3cfedbb9f6e8b020ed3ff94179ce631ee92aefc2/google/transport/error_retry_predicates.go#L120 Based on test logs it looks like the retry gets called repeatedly throughout a test - presumably until some limit is hit (hopefully not a timeout). I'll look into whether it's possible to add backoff and jitter if those aren't already present, or increase the number of retries / timeout.

rileykarson commented 1 year ago

I noticed in TestAccComputeInstance_resourcePolicyUpdate (in this execution) we're hitting a context deadline really early. 2m15s instead of 20m or so. Maybe we're attaching a short one to the retry transport?

melinath commented 1 week ago

Tentatively assigning test-failure-10