hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.25k stars 1.7k forks source link

GKE cluster creation fails with obscure error when using empty "fleet" block when relying on provider default project #16680

Open jbrook opened 7 months ago

jbrook commented 7 months ago

GKE cluster creation fails when relying on the provider's default project and setting an empty fleet block.

Community Note

Terraform Version

Terraform v1.6.5 on linux_amd64

Affected Resource(s)

gooogle_container_cluster

Terraform Configuration Files

resource "google_container_cluster" "cluster" {
  name                = "asm-cluster-2"
  location            = var.region
  resource_labels     = { mesh_id : "proj-${data.google_project.project.number}" }
  deletion_protection = false # Warning: Do not set deletion_protection to false for production clusters

  enable_autopilot = true
  fleet {}
}

data "google_project" "project" {}

variable "region" {
  type        = string
  default     = "us-central1"
  description = "The region to host the cluster in (Autopilot clusters are always regional)"
}

Debug Output

https://gist.github.com/jbrook/40626e57c1062f2d46a4f0a0b26676dd

Panic Output

n/a

Expected Behavior

A GKE Autopilot cluster should be created. It should be a member of a Fleet using the cluster's project.

Actual Behavior

Cluster creation fails with an obscure timeout error:

 Error: timeout while waiting for state to become 'success' (timeout: 1m0s)
β”‚
β”‚   with google_container_cluster.cluster,
β”‚   on main.tf line 16, in resource "google_container_cluster" "cluster":
β”‚   16: resource "google_container_cluster" "cluster" {

Terraform debug log shows that it's an API error. No project ID is sent when POSTing to the GKE API to create a cluster:

POST /v1/projects/my-project-name/locations/us-central1/clusters?alt=json&pret
Host: container.googleapis.com
User-Agent: google-api-go-client/0.5 Terraform/1.6.5 (+https://www.terraform.i
Content-Length: 809
Content-Type: application/json
X-Goog-Api-Client: gl-go/1.19.9 gdcl/0.148.0
Accept-Encoding: gzip
{
 "cluster": {
  "autopilot": {
   "enabled": true
  },
  "binaryAuthorization": {
   "enabled": false
  },
  "fleet": {},
  "legacyAbac": {
   "enabled": false
  },
  "maintenancePolicy": {
   "window": {}
  },
  "masterAuthorizedNetworksConfig": {},
  "name": "asm-cluster-2",
  "network": "projects/my-project-name/global/networks/default",
  "networkConfig": {
   "enableIntraNodeVisibility": true
  },
  "networkPolicy": {},
  "nodeConfig": {
   "oauthScopes": [
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/logging.write",
    "https://www.googleapis.com/auth/monitoring",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/trace.append"
   ]
  },
  "notificationConfig": {
   "pubsub": {}
  },
  "resourceLabels": {
   "mesh_id": "proj-xxxxxxxxxxxxx"
  },
  "shieldedNodes": {
   "enabled": true
  }
 }
}

The API responds with a 500 error triggered by an Anthos entitlement check. Note that the cause is an invalid resource name because the project ID is missing:

{
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:   "error": {
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     "code": 500,
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     "message": "Internal error encountered.",
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     "errors": [
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:       {
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:         "message": "Internal error encountered.",
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:         "domain": "global",
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:         "reason": "backendError",
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:         "debugInfo": "stack_entries:
<snip>
"\ndetail: \"ENTERPRISE_ANTHOS_ENTITLEMENT_ERROR: failed to verify anthos entitlement: INVALID_ARGUMENT: generic::invalid_argument: Invalid project resource name projects/\"\n"
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:       }
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     ],
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     "status": "INTERNAL",
<snip>
"detail": "ENTERPRISE_ANTHOS_ENTITLEMENT_ERROR: failed to verify anthos entitlement: INVALID_ARGUMENT: generic::invalid_argument: Invalid project resource name projects/"
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:       }
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:     ]
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5:   }
2023-12-05T16:32:57.966Z [DEBUG] provider.terraform-provider-google_v5.8.0_x5: }

The GKE Operations logs in Cloud Logging only show "Internal Error" "13" making this hard to debug.

It seems that the GKE API does not accept an empty project ID when creating a cluster. The provider documentation for the fleet block says that the project argument is optional.

The project should be set to the provider default project when calling the API.

Cluster creation succeeds when setting the project name in the fleet block using the google_project datasource as follows:

resource "google_container_cluster" "cluster" {
  name                = "asm-cluster-2"
  location            = var.region
  resource_labels     = { mesh_id : "proj-${data.google_project.project.number}" }
  deletion_protection = false # Warning: Do not set deletion_protection to false for production clusters

  enable_autopilot = true
  fleet {
    project = data.google_project.project.name
  }
}

Steps to Reproduce

  1. GOOGLE_CLOUD_PROJECT=my-project-name terraform apply

Important Factoids

References

b/315120659

edwardmedia commented 7 months ago

@jbrook I think this is expected. project is optional if it is provided by the provider defaults. One way or the other, you still need to provide these config values. This rule is applied on other fields like region, zone, etc.

Did I misunderstand your issue?

jbrook commented 7 months ago

I think so - I provided the project via the provider defaults (GOOGLE_CLOUD_PROJECT environment variable). The fleet block didn't use it.

edwardmedia commented 7 months ago

OH I see what you meant now. Thanks @jbrook

Here is the questionable code.

jiayimeow commented 1 month ago

Thanks James and Edward for looking into this issue!

The project field under the fleet block is different from the cluster project by concept. It could be the cluster project, or it could also be a different project. Therefore, user needs to provide an explicit value of fleet project in order to register the cluster to that fleet project. The provider defaults (GOOGLE_CLOUD_PROJECT) will not be used automatically as fleet project.

As you mentioned, the error message (internal error) does seems confusing. We will work on improving the error handling here.