hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.36k stars 1.75k forks source link

google_container_node_pool create fails with googleapi 404 error, subsequent applies fail with node_pool already exists #20087

Open arueth opened 1 month ago

arueth commented 1 month ago

Community Note

Terraform Version & Provider Version(s)

Terraform v1.9.7 on amd64/x86_64

Affected Resource(s)

google_container_node_pool

Terraform Configuration

Using the Terraform located here: https://github.com/GoogleCloudPlatform/accelerated-platforms/tree/main/platforms/gke-aiml/playground

Debug Output

No response

Expected Behavior

The nodepool is created successfully

Actual Behavior

β”‚ Error: googleapi: got HTTP response code 404 with body: <!DOCTYPE html>
β”‚ <html lang=en>
β”‚   <meta charset=utf-8>
β”‚   <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
β”‚   <title>Error 404 (Not Found)!!1</title>
β”‚   <style>
β”‚     *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
β”‚   </style>
β”‚   <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
β”‚   <p><b>404.</b> <ins>That’s an error.</ins>
β”‚   <p>The requested URL <code>/v1/?alt=json&amp;prettyPrint=false</code> was not found on this server.  <ins>That’s all we know.</ins>
β”‚ 
β”‚ 
β”‚   with google_container_node_pool.gpu_a100x2_a2u2_res,
β”‚   on container_node_pool.tf line 345, in resource "google_container_node_pool" "gpu_a100x2_a2u2_res":
β”‚  345: resource "google_container_node_pool" "gpu_a100x2_a2u2_res" {

Steps to reproduce

  1. export MLP_PROJECT_ID=<Project ID>
  2. test/scripts/qwiklabs/playground_byop_oci_apply.sh

I have only seen it happen when trying to provision multiple environments for a workshop. Even then, it only happens maybe 10% or less of the times. I have not seen the error when doing single environment testing.

Important Factoids

This happens somewhat frequently when provisioning labs for Qwiklabs and it is not always the same node pool.

References

It seems like now that this was fixed https://github.com/hashicorp/terraform-provider-google/issues/19895, this issue is the new error.

b/381135470

ggtisc commented 4 weeks ago

Hi @arueth!

Being a large project, there may be several factors that influence the error you are having. Have you tried creating only the google_container_node_pool resource with its google_container_cluster?

I also see that the file that is supposed to contain the code for the resource google_container_node_pool is empty and only contains a single line. Would it be possible for you to share here just the google_container_node_pool code with its google_container_cluster so we can test it in an isolated project and find out what could be causing it?

arueth commented 1 week ago

I'm sure what you mean by "Have you tried creating only the google_container_node_pool resource with its google_container_cluster?" The cluster and node pool creation is part of the overall terraform apply.

The resource containing the google_container_node_pools is not empty, it is a symbolic link to another file.

ggtisc commented 1 week ago

I tried to reproduce this issue many times but can't find any error. Talking with @arueth through chat He explains that this is a sporadic failure and thus difficult to reproduce, but it happens quite a bit when spinning up lots of labs in qwiklabs. And he is relating this issue with this one because it looks like sometimes it returns a different message.