hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.33k stars 1.73k forks source link

Terraform crashing when creating a dataproc cluster #19321

Closed tempus-omarsalka closed 1 month ago

tempus-omarsalka commented 2 months ago

Community Note

Terraform Version & Provider Version(s)

Terraform v1.5.7 on amd64

Affected Resource(s)

google_dataproc_cluster

Terraform Configuration

resource "google_dataproc_cluster" "default" {
  provider                      = google-beta # tried the other provider too
  name                          = var.name
  region                        = coalesce(var.region, var.project_context.region)
  labels                        = var.project_context.labels
  project                       = var.project_context.project_id
  graceful_decommission_timeout = var.graceful_decommission_timeout

  dynamic "cluster_config" {
    for_each = var.cluster_config != null ? [null] : []
    content {
      staging_bucket = var.cluster_config.staging_bucket
      temp_bucket    = var.cluster_config.temp_bucket
      dynamic "master_config" {
        for_each = var.master_config != null ? [var.master_config] : []
        content {
          num_instances    = var.master_config.num_instances
          machine_type     = var.master_config.machine_type
          min_cpu_platform = var.master_config.min_cpu_platform
          disk_config {
            boot_disk_type    = var.master_config.disk_config.boot_disk_type
            boot_disk_size_gb = var.master_config.disk_config.boot_disk_size_gb
          }
          dynamic "accelerators" {
            for_each = var.accelerators
            content {
              accelerator_type  = accelerators.value.accelerator_type
              accelerator_count = accelerators.value.accelerator_count
            }
          }
        }
      }

      dynamic "worker_config" {
        for_each = var.worker_config != null ? [var.worker_config] : []
        content {
          num_instances    = var.worker_config.num_instances
          machine_type     = var.worker_config.machine_type
          min_cpu_platform = var.worker_config.min_cpu_platform
          disk_config {
            boot_disk_type    = var.worker_config.disk_config.boot_disk_type
            boot_disk_size_gb = var.worker_config.disk_config.boot_disk_size_gb
            num_local_ssds    = var.worker_config.disk_config.num_local_ssds
          }
        }
      }

      dynamic "autoscaling_config" {
        for_each = var.autoscaling_config != null ? [var.autoscaling_config] : []
        content {
          policy_uri = var.autoscaling_config.policy_uri
        }
      }

      dynamic "preemptible_worker_config" {
        for_each = var.preemptible_worker_config != null ? [var.preemptible_worker_config] : []
        content {
          num_instances = var.preemptible_worker_config.num_instances
          disk_config {
            boot_disk_type    = var.preemptible_worker_config.disk_config.boot_disk_type
            boot_disk_size_gb = var.preemptible_worker_config.disk_config.boot_disk_size_gb
            num_local_ssds    = var.preemptible_worker_config.disk_config.num_local_ssds
          }
          preemptibility = var.preemptible_worker_config.preemptibility
        }
      }

      dynamic "software_config" {
        for_each = var.software_config != null ? [var.software_config] : []
        content {
          image_version       = var.software_config.image_version
          optional_components = var.software_config.optional_components
          override_properties = var.software_config.override_properties
        }
      }

      dynamic "gce_cluster_config" {
        for_each = var.gce_cluster_config != null ? [var.gce_cluster_config] : []
        content {
          internal_ip_only       = lookup(var.gce_cluster_config, "internal_ip_only", null)
          metadata               = lookup(var.gce_cluster_config, "metadata", null)
          service_account        = lookup(var.gce_cluster_config, "service_account", null)
          service_account_scopes = lookup(var.gce_cluster_config, "service_account_scopes", null)
          subnetwork             = lookup(var.gce_cluster_config, "subnetwork", null)
          tags                   = lookup(var.gce_cluster_config, "tags", null)
          zone                   = lookup(var.gce_cluster_config, "zone", null)
        }
      }

      # You can define multiple initialization_action blocks
      dynamic "initialization_action" {
        for_each = var.initialization_action != null ? [var.initialization_action] : []
        content {
          script      = var.initialization_action.script
          timeout_sec = var.initialization_action.timeout_sec
        }
      }

      # You can define multiple initialization_action blocks
      dynamic "initialization_action" {
        for_each = var.initialization_actions
        content {
          script      = initialization_action.value.script
          timeout_sec = initialization_action.value.timeout_sec
        }
      }

      dynamic "lifecycle_config" {
        for_each = var.lifecycle_config != null ? [var.lifecycle_config] : []
        content {
          idle_delete_ttl = var.lifecycle_config.idle_delete_ttl
        }
      }
    }
  }
  lifecycle {
    ignore_changes = [
      labels,
      cluster_config[0].worker_config[0].num_instances,
      cluster_config[0].preemptible_worker_config[0].num_instances,
    ]
  }
}
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 0" # "hashicorp/google" (latest)
    }
  }
}

We usually pull the latest version, but after this issue, we tried pinning multiple different v5 versions but to no avail.

Debug Output

This is where it gets weird. When i run terraform in regular mode, terrafrom panics and crashes, but the cluster still gets created; however, in debug mode, terraform runs successfully with no panic message or anything. I'm attaching the panic message that i'm only able to obtain on a regular run (no DEBUG):

google_secret_manager_secret_iam_binding.jsl_secret_version: Destroying... [id=projects/XXXXXXXXXXX/secrets/JSL_SECRET_VERSION/roles/secretmanager.secretAccessor]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Creating...
google_secret_manager_secret_iam_binding.jsl_secret_version: Destruction complete after 3s
google_secret_manager_secret_iam_binding.jsl_secret_version: Creating...
google_secret_manager_secret_iam_binding.jsl_secret_version: Creation complete after 3s [id=projects/XXXXXXXXXXX/secrets/JSL_SECRET_VERSION/roles/secretmanager.secretAccessor]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [10s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [20s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [30s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [40s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [50s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m0s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m10s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m20s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m30s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m40s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [1m50s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m0s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m10s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m20s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m30s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m40s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [2m50s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [3m0s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [3m10s elapsed]
module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default: Still creating... [3m20s elapsed]
â•·
│ Warning: Deprecated Resource
│ 
│   with module.dataproc.google_notebooks_instance.instance,
│   on .terraform/modules/dataproc/modules/gcp/dataproc/environment/dataproc-hub.tf line 23, in resource "google_notebooks_instance" "instance":
│   23: resource "google_notebooks_instance" "instance" {
│ 
│ `google_notebook_instance` is deprecated and will be removed in a future
│ major release. Use `google_workbench_instance` instead.
│ 
│ (and 2 more similar warnings elsewhere)
╵
â•·
│ Error: Plugin did not respond
│ 
│   with module.dataproc.module.dataproc_cluster.google_dataproc_cluster.default,
│   on .terraform/modules/dataproc/modules/gcp/compute/dataproc/main.tf line 1, in resource "google_dataproc_cluster" "default":
│    1: resource "google_dataproc_cluster" "default" {
│ 
│ The plugin encountered an error, and failed to respond to the
│ plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may
│ contain more details.
╵

Stack trace from the terraform-provider-google-beta_v5.42.0_x5 plugin:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x33688f4]

goroutine 73 [running]:
github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc.flattenKerberosConfig(0x49deb00?, 0x0)
    github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc/resource_dataproc_cluster.go:2896 +0x34
github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc.flattenSecurityConfig(...)
    github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc/resource_dataproc_cluster.go:2888
github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc.flattenClusterConfig(0xc001c86e80?, 0xc0014161e0)
    github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc/resource_dataproc_cluster.go:2856 +0x6b5
github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc.resourceDataprocClusterRead(0xc00011a2d0?, {0x50737e0?, 0xc0017b3800})
    github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc/resource_dataproc_cluster.go:2698 +0x51a
github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc.resourceDataprocClusterCreate(0x0?, {0x50737e0?, 0xc0017b3800})
    github.com/hashicorp/terraform-provider-google-beta/google-beta/services/dataproc/resource_dataproc_cluster.go:1788 +0x650
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0x58ecc98?, {0x58ecc98?, 0xc001451c80?}, 0xd?, {0x50737e0?, 0xc0017b3800?})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:766 +0x163
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0010e9dc0, {0x58ecc98, 0xc001451c80}, 0xc001078b60, 0xc00142ff80, {0x50737e0, 0xc0017b3800})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:909 +0xa89
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc00115ef30, {0x58ecc98?, 0xc001451b90?}, 0xc001446c80)
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/grpc_provider.go:1078 +0xdbc
github.com/hashicorp/terraform-plugin-mux/tf5muxserver.(*muxServer).ApplyResourceChange(0x58eccd0?, {0x58ecc98?, 0xc001451890?}, 0xc001446c80)
    github.com/hashicorp/terraform-plugin-mux@v0.15.0/tf5muxserver/mux_server_ApplyResourceChange.go:36 +0x193
github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc000336d20, {0x58ecc98?, 0xc001450ea0?}, 0xc0002770a0)
    github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov5/tf5server/server.go:865 +0x3d0
github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x500fd00?, 0xc000336d20}, {0x58ecc98, 0xc001450ea0}, 0xc00142ee80, 0x0)
    github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:518 +0x169
google.golang.org/grpc.(*Server).processUnaryRPC(0xc001164400, {0x58ecc98, 0xc001450e10}, {0x58f8928, 0xc0004afb00}, 0xc000e3ca20, 0xc00122e870, 0x78372f8, 0x0)
    google.golang.org/grpc@v1.64.1/server.go:1379 +0xe23
google.golang.org/grpc.(*Server).handleStream(0xc001164400, {0x58f8928, 0xc0004afb00}, 0xc000e3ca20)
    google.golang.org/grpc@v1.64.1/server.go:1790 +0x1016
google.golang.org/grpc.(*Server).serveStreams.func2.1()
    google.golang.org/grpc@v1.64.1/server.go:1029 +0x8b
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 51
    google.golang.org/grpc@v1.64.1/server.go:1040 +0x135

Error: The terraform-provider-google-beta_v5.42.0_x5 plugin crashed!

This is always indicative of a bug within the plugin. It would be immensely
helpful if you could report the crash with the plugin's maintainers so that it
can be fixed. The output above should help diagnose the issue.

Expected Behavior

Terraform completes cluster creation and exits successfully without crashing.

Actual Behavior

The dataproc cluster gets created but terraform panics and crashes.

Steps to reproduce

  1. terraform apply

Important Factoids

Terraform was last able to run this without issues on Aug 27th. Error first observed this today.

References

No response

b/363256703

piotrdziuba commented 2 months ago

Hi,

We were debugging exactly the same issue yesterday. Our observation is that this is most likely to due to empty field returned in the describe cluster action at the end of the provider spawning process :

"securityConfig": {},

which is then being destructured https://github.com/hashicorp/terraform-provider-google/blob/v6.0.1/google/services/dataproc/resource_dataproc_cluster.go#L2884

Possibly this might have something to do with some recent (not verified though) addition of the field identityConfig (dataproc REST API) that is not mentioned currently anywhere in the provider itself.

axp414 commented 2 months ago

Hi,

Any suggestions on how to resolve this issue or work around it?

roaks3 commented 2 months ago

Thanks, the logging and analysis are extremely helpful. Agreed that the issue appears to be a nil kerberosConfig, which could be caused by an empty securityConfig. This code has been untouched for ~5 years, so I would suspect an API change is the culprit (new identityConfig field like @piotrdziuba mentioned is very plausible, since previously we had kerberosConfig marked as required). We probably need a new nil check to resolve.

For workarounds, it would appear we need the kerberosConfig field to be populated with something to avoid the panic, which I expect is not possible for configs that don't use it or securityConfig at all. I don't think ignore_changes will help either, because of where this call is being made. Rolling back the change at the API level (or fixing forward by not returning {}) could be the only option, so we will check with the service team about options there.

axo103 commented 2 months ago

hi guys this seems to be a google issue

roaks3 commented 2 months ago

Additional note: we are seeing a large number of dataproc tests failing with this same panic, which started last night. This includes even very simple tests like TestAccDataprocCluster_basic.

roaks3 commented 2 months ago

I haven't identified an obvious change that would have caused this yet, but it looks to me like identityConfig has been around for a while. However, there was another undocumented field that was added to securityConfig and could be responsible.

roaks3 commented 2 months ago

Note: seeing some evidence of these panics as early as Aug 28

roaks3 commented 2 months ago

Update: The service team has been actively working on this from the API side.

estol commented 1 month ago

Hey @roaks3 could you share an estimate when this might get resolved?

roaks3 commented 1 month ago

The API changes were rolled back on Friday, so this should be resolved now. Moving forward, we will be working with the service team to ensure that the provider is ready to handle the change before it is rolled out again.

@estol could you confirm if you are still seeing any issues?

estol commented 1 month ago

@roaks3 seems to be working, thank you.

tempus-omarsalka commented 1 month ago

Yup working for me too, thanks!

roaks3 commented 1 month ago

For the service team: https://github.com/GoogleCloudPlatform/magic-modules/pull/11592 aims to resolve the actual error on the client-side, which in addition to better testing, should ensure that this doesn't surface in future rollouts.

github-actions[bot] commented 3 weeks ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.