hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.27k stars 1.72k forks source link

Panic when updating google_container_node_pool node_config linux_node_config {} #12584

Open billyfoss opened 1 year ago

billyfoss commented 1 year ago

Community Note

Terraform Version

Terraform v1.2.9 on linux_amd64

Affected Resource(s)

Terraform Configuration Files

I am using the Google Kubernetes Engine CFT modules. I patched manually to workaround an issue where the linux_node_config was seeing an empty sysctl being set and Terraform could not remove it. The linux_node_config is being added to some clusters, but not all. When it is added to a cluster, it seems to be on all node pools in that cluster.

When running with default module, I see a perm-diff of

              - linux_node_config {
                  - sysctls = {} -> null
                }

to ensure this parameter was set, I patched the CFT module as such https://github.com/billyfoss/terraform-google-kubernetes-engine/commit/baf9dc1a2bc4505324844c7ef44c26e8306b9e99

This gave a plan of


            # (13 unchanged attributes hidden)

          + linux_node_config {}

            # (2 unchanged blocks hidden)

When running with that, I got the panic below.

Debug Output

Panic Output

Note: this is a redacted/limited output of the panic. If I see it again, I can parse out a smaller set of code to recreate it.


Unless you have made equivalent changes to your configuration, or ignored the
relevant attributes using ignore_changes, the following plan may include
actions to undo or respond to these changes.

─────────────────────────────────────────────────────────────────────────────

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place
  - destroy
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # module.gke-common.google_container_node_pool.pools["default"] will be updated in-place
  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/my-gke-cluster/locations/us-east1/clusters/my-gke-cluster/nodePools/default"
        name                        = "default"
        # (10 unchanged attributes hidden)

      ~ node_config {
            tags              = [
                "gke-my-gke-cluster",
                "gke-my-gke-cluster-default",
            ]
            # (13 unchanged attributes hidden)

          + linux_node_config {}

            # (2 unchanged blocks hidden)
        }

        # (5 unchanged blocks hidden)
    }

Warning: Resource targeting is in effect

You are creating a plan with the -target option, which means that the result
of this plan may not represent all of the changes requested by the current
configuration.

The -target option is not for routine use, and is provided only for
exceptional situations such as recovering from errors or mistakes, or when
Terraform specifically suggests to use it as part of an error message.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.gke-common.google_container_node_pool.pools["default"]: Modifying... [id=projects/my-gke-cluster/locations/us-east1/clusters/my-gke-cluster/nodePools/default]

Warning: Applied changes may be incomplete

The plan was created with the -target option in effect, so some changes
requested in the configuration may have been ignored and the output values
may not be fully updated. Run the following command to verify that no other
changes are pending:
    terraform plan

Note that the -target option is not suitable for routine use, and is provided
only for exceptional situations such as recovering from errors or mistakes,
or when Terraform specifically suggests to use it as part of an error
message.

Error: Plugin did not respond

  with module.gke-common.google_container_node_pool.pools["default"],
  on .terraform/modules/gke-common/modules/beta-private-cluster/cluster.tf line 416, in resource "google_container_node_pool" "pools":
 416: resource "google_container_node_pool" "pools" {

The plugin encountered an error, and failed to respond to the
plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may contain
more details.

Stack trace from the terraform-provider-google-beta_v4.33.0_x5 plugin:

panic: interface conversion: interface {} is nil, not map[string]interface {}

goroutine 78 [running]:
github.com/hashicorp/terraform-provider-google-beta/google-beta.expandLinuxNodeConfig(...)
    github.com/hashicorp/terraform-provider-google-beta/google-beta/node_config.go:621
github.com/hashicorp/terraform-provider-google-beta/google-beta.nodePoolUpdate(0xc000d58300, {0x2df81c0?, 0xc000ba5000}, 0xc0010bf110, {0x0, 0x0}, 0x274a48a7800)
    github.com/hashicorp/terraform-provider-google-beta/google-beta/resource_container_node_pool.go:1094 +0x20f4
github.com/hashicorp/terraform-provider-google-beta/google-beta.resourceContainerNodePoolUpdate(0xc000d58300, {0x2df81c0?, 0xc000ba5000?})
    github.com/hashicorp/terraform-provider-google-beta/google-beta/resource_container_node_pool.go:538 +0x3b4
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).update(0x337a1b0?, {0x337a1b0?, 0xc000a92720?}, 0xd?, {0x2df81c0?, 0xc000ba5000?})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/resource.go:729 +0x178
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0009927e0, {0x337a1b0, 0xc000a92720}, 0xc000820820, 0xc000dfb680, {0x2df81c0, 0xc000ba5000})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/resource.go:847 +0x82c
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc00000d830, {0x337a1b0?, 0xc000a92660?}, 0xc000b7ed70)
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/grpc_provider.go:1021 +0xe3c
github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc0004f4320, {0x337a1b0?, 0xc000a63ec0?}, 0xc000546930)
    github.com/hashicorp/terraform-plugin-go@v0.10.0/tfprotov5/tf5server/server.go:813 +0x4fc
github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x2d894a0?, 0xc0004f4320}, {0x337a1b0, 0xc000a63ec0}, 0xc000a90060, 0x0)
    github.com/hashicorp/terraform-plugin-go@v0.10.0/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:385 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00023a700, {0x337f800, 0xc0009f29c0}, 0xc00082c480, 0xc0009cf2f0, 0x4587840, 0x0)
    google.golang.org/grpc@v1.47.0/server.go:1283 +0xcfd
google.golang.org/grpc.(*Server).handleStream(0xc00023a700, {0x337f800, 0xc0009f29c0}, 0xc00082c480, 0x0)
    google.golang.org/grpc@v1.47.0/server.go:1620 +0xa1b
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    google.golang.org/grpc@v1.47.0/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    google.golang.org/grpc@v1.47.0/server.go:920 +0x28a

Error: The terraform-provider-google-beta_v4.33.0_x5 plugin crashed!

This is always indicative of a bug within the plugin. It would be immensely
helpful if you could report the crash with the plugin's maintainers so that it
can be fixed. The output above should help diagnose the issue.

Expected Behavior

I was hoping to be able to trigger the linux_node_config parameter to become active in the cluster node pool without recreating the existing nodes.

Actual Behavior

Panic

Steps to Reproduce

  1. terraform apply

Important Factoids

Note: I am running 4.33 because of https://github.com/hashicorp/terraform-provider-google/issues/12422

References

b/302797579

billyfoss commented 1 year ago

Note: when trying to apply the plan

          - linux_node_config {
              - sysctls = {} -> null
            }

I get

Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels] must be specified., badRequest
edwardmedia commented 1 year ago

Possible dup https://github.com/hashicorp/terraform-provider-google/issues/12557

arikmaor commented 1 year ago

Had same problem Managed to workaround it for now by adding:

    linux_node_config {
      sysctls = {}
    }

to the node_config block

alexmeise commented 1 year ago

I had to remove:


 linux_node_config {
      sysctls = {}
    }

in order to be able to deploy a new nodepool.

otherwise:

panic: interface conversion: interface {} is nil, not map[string]interface {}

SarahFrench commented 1 year ago

Hi @billyfoss - I'm having some trouble reproducing this issue, could you please post an example of how you're calling the beta-private-cluster module?

heaton-dev commented 1 year ago

Getting the same when setting linux_node_config.sysctls in google_container_node_pool.

module.nodepool-1[0].google_container_node_pool.nodepool: Modifying... [id=projects/k1-k8s-1-ggya/locations/europe-west1/clusters/k1-k8s-1-ggya-45a0/nodePools/nodepool-1]
β•·
β”‚ Error: Plugin did not respond
β”‚ 
β”‚   with module.nodepool-1[0].google_container_node_pool.nodepool,
β”‚   on fabric/modules/gke-nodepool/main.tf line 70, in resource "google_container_node_pool" "nodepool":
β”‚   70: resource "google_container_node_pool" "nodepool" {
β”‚ 
β”‚ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may contain more details.
β•΅

Stack trace from the terraform-provider-google-beta_v4.44.1_x5 plugin:

panic: interface conversion: interface {} is nil, not map[string]interface {}

goroutine 62 [running]:
github.com/hashicorp/terraform-provider-google-beta/google-beta.expandLinuxNodeConfig(...)
        github.com/hashicorp/terraform-provider-google-beta/google-beta/node_config.go:749
github.com/hashicorp/terraform-provider-google-beta/google-beta.nodePoolUpdate(0xc000e8c680, {0x2f77760?, 0xc000e80900}, 0xc0010745d0, {0x0, 0x0}, 0x1a3185c5000)
        github.com/hashicorp/terraform-provider-google-beta/google-beta/resource_container_node_pool.go:1357 +0x2c55
github.com/hashicorp/terraform-provider-google-beta/google-beta.resourceContainerNodePoolUpdate(0xc000e8c680, {0x2f77760?, 0xc000e80900?})
        github.com/hashicorp/terraform-provider-google-beta/google-beta/resource_container_node_pool.go:618 +0x3b4
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).update(0x3546620?, {0x3546620?, 0xc000652780?}, 0xd?, {0x2f77760?, 0xc000e80900?})
        github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/resource.go:729 +0x178
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc000ce7dc0, {0x3546620, 0xc000652780}, 0xc00063bd40, 0xc000e8c500, {0x2f77760, 0xc000e80900})
        github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/resource.go:847 +0x82c
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc00000cfc0, {0x3546620?, 0xc0006526c0?}, 0xc0006243c0)
        github.com/hashicorp/terraform-plugin-sdk/v2@v2.18.0/helper/schema/grpc_provider.go:1021 +0xe3c
github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc0002ec1e0, {0x3546620?, 0xc000d2df50?}, 0xc0013c1650)
        github.com/hashicorp/terraform-plugin-go@v0.10.0/tfprotov5/tf5server/server.go:813 +0x4fc
github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x2f05580?, 0xc0002ec1e0}, {0x3546620, 0xc000d2df50}, 0xc0013c15e0, 0x0)
        github.com/hashicorp/terraform-plugin-go@v0.10.0/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:385 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000001e0, {0x354c030, 0xc000712000}, 0xc000639b00, 0xc000d2d5c0, 0x47e0e20, 0x0)
        google.golang.org/grpc@v1.50.1/server.go:1340 +0xd13
google.golang.org/grpc.(*Server).handleStream(0xc0000001e0, {0x354c030, 0xc000712000}, 0xc000639b00, 0x0)
        google.golang.org/grpc@v1.50.1/server.go:1713 +0xa1b
google.golang.org/grpc.(*Server).serveStreams.func1.2()
        google.golang.org/grpc@v1.50.1/server.go:965 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
        google.golang.org/grpc@v1.50.1/server.go:963 +0x28a

Error: The terraform-provider-google-beta_v4.44.1_x5 plugin crashed!

This is always indicative of a bug within the plugin. It would be immensely
helpful if you could report the crash with the plugin's maintainers so that it
can be fixed. The output above should help diagnose the issue.
SarahFrench commented 1 year ago

Hi @joeheaton , thanks for providing information about the bug! Are you also using the beta-private-cluster module?

Could you please post some example Terraform configuration or other description in this issue to help me reproduce the bug you're seeing? For example, if you are using the module above what value are you passing in as the node_pools_linux_node_configs_sysctls input?

ivocalado commented 1 year ago

I just faced this exactly problem when using google-beta provider. In my case, I could workaround the problem by replacing google-beta by google provider.

tomzx commented 1 year ago

As of 4.66.0, it seems this problem is present in both google and google-beta providers.

Guent4 commented 1 year ago

Some notes of what I've noticed wrt this issue:

Unfortunately, I don't really have any conclusive findings to offer, but wanted to share any observations that I can to help

Guent4 commented 1 year ago

Figured out why sometimes we see the

        - linux_node_config {
          - sysctls = {} -> null
        }

diff. It is based on whether the Google API returns back the block. You can verify by running gcloud container clusters describe .... If that returns back something for nodePools[*].config.linuxNodeconfig: {}, then you'll have the above diff. Unfortunately, still not sure why some clusters will return back the field and some won't.

We are bypassing this issue for now by setting

  lifecycle {
    ignore_changes = [
      node_config[0].linux_node_config
    ]
  }

because we don't actually use that block. Not a permanent fix for sure.

kustodian commented 2 weeks ago

I'm getting the same problem on the latest Google provider 5.39.1 but with the kubelet_config. Here is the diff:

  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/project1/locations/asia-east2/clusters/cluster-2/nodePools/default-spot"
        name                        = "default-spot"
        # (10 unchanged attributes hidden)

      ~ node_config {

          - kubelet_config {
              - cpu_cfs_quota  = false -> null
              - pod_pids_limit = 0 -> null
            }

            # (3 unchanged blocks hidden)
        }

        # (4 unchanged blocks hidden)
    }

and when I try to apply it I get this output:

β”‚ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning', 'max_run_duration'] must be specified.
β”‚ Details:
β”‚ [
β”‚   {
β”‚     "@type": "type.googleapis.com/google.rpc.RequestInfo",
β”‚     "requestId": "0x69268b2f63ffd33c"
β”‚   }
β”‚ ]
β”‚ , badRequest
β”‚ 
β”‚   with module.gke_cluster.google_container_node_pool.pools["default-spot"],
β”‚   on ../../../../modules/gke-cluster/node_pools.tf line 4, in resource "google_container_node_pool" "pools":
β”‚    4: resource "google_container_node_pool" "pools" {

I know exactly what I did for the GCP API to start returning the kubelet_config block. I got an email from GCP to disable the insecure kubelet port. Because it's still not implemented in Terraform (issue), I followed the GCP guide and after this when I ran TF it reported that it wants to remove the kubelet_config block because I didn't set it in TF. For now I workaround it like @Guent4 suggested with ignore_changes, but this is not the right way to go.

This issue should get a higher priority because Google's suggestion is causing this problem in the first place.