hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.28k stars 1.72k forks source link

GCP Terraform dataproc - ignores internal_ip_only value as False #17436

Open tenstriker opened 6 months ago

tenstriker commented 6 months ago

Terraform Version

google provider 4.75.0 and 5.17.0 Terraform v1.7.4

Affected Resource(s)

google_dataproc_cluster

Terraform Configuration

Happens with unchanged plan. Just trying to re-execute things. Only thing that happens on re-applying same stuff is update of terraform iteself


resource "google_dataproc_cluster" "my-cluster" {
  name    = local.cluster_name
  project = local.project.gcp_project_id
  region  = "us-central1"

  cluster_config {

    gce_cluster_config {

      service_account_scopes = ["useraccounts-ro", "storage-rw", "logging-write", "cloud-platform"]
    }

    master_config {
      num_instances = local.dataproc.master_config.num_instances
      machine_type  = local.dataproc.master_config.machine_type
    }

    worker_config {
      num_instances = local.dataproc.worker_config.num_instances
      machine_type  = local.dataproc.worker_config.machine_type
    }

    software_config {
      image_version = "2.2.0-RC3-debian11"
      override_properties = {
        "dataproc:dataproc.logging.stackdriver.enable"            = "true"
        "dataproc:jobs.file-backed-output.enable"                 = "true"
        "dataproc:dataproc.logging.stackdriver.job.driver.enable" = "true"
        "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" = "true"
        "spark:spark.history.fs.update.interval"                  = "900"
        "spark:spark.history.fs.cleaner.enabled"                  = "true"
        "spark:spark.history.fs.cleaner.interval"                 = "1d"
        "spark:spark.history.fs.cleaner.maxAge"                   = "30d"      
      }
      # optional_components = [
      #   "JUPYTER"
      # ]
    }

    endpoint_config {
      enable_http_port_access = "true"
    }
    dynamic "autoscaling_config" {
      for_each = "${local.workspace == "prod" ? [1] : []}"
      content {
        policy_uri = google_dataproc_autoscaling_policy.asp.name
      }
    }
  }

}

resource "google_dataproc_autoscaling_policy" "asp" {
  policy_id = "default-policy"
  location  = "us-central1"

  worker_config {
    max_instances = local.dataproc_asp.worker_config.max_instances
  }

  secondary_worker_config {
    max_instances = local.dataproc_asp.secondary_worker_config.max_instances
  }

  basic_algorithm {
    cooldown_period = "120s"
    yarn_config {
      graceful_decommission_timeout = "7200s"
      scale_up_factor               = 0.5
      scale_down_factor             = 0.5
    }
  }
}

Debug Output

Error messsage: INVALID_ARGUMENT: Subnetwork 'default' does not support Private Google Access which is required for Dataproc clusters when 'internal_ip_only' is set to 'true'. Enable Private Google Access on subnetwork 'default' or set 'internal_ip_only' to 'false'.

Just pasting snippet as debug out has lot of confidential info. Issue is around not respecting default or explicit value of internal_ip_only when it sets to False . (it is set to false by default) instead upon Terraform Apply it consider it to be true. (based on error message)

You can see from debug log that internal_ip_only is completely missing from the Request. GCP TF eats it. I assume gcp backend marks it as true if its not part of request payload and fails the whole request.

https://gist.github.com/tenstriker/de36db2baf3ae0d309f73485fefb769c

2

Expected Behavior

gcp tf to send value of internal_ip_only as false by default . at least send it when set explicitly.

Actual Behavior

it throws 400 as it thinks internal_ip_only is set to true and network value is default.

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

No response

Fyi, Cluster creation works fine with gcloud cli with similar configuration and external ip gets assigned as well as I'm using default subnet

Update: Seems like dataproc image versions 2.2. has this breaking. Issue doesnt surface with dataproc image version 2.1. (see my last comment) b/327455169

update 03/01/2024 gcloud cli also doesn't work with dataproc image version 2.2.* after I updated gcloud cli itsefl using gcloud components update . The message on update was:

Your current Google Cloud CLI version is: 450.0.0 You will be upgraded to version: 466.0.0

so it was working in gcloud cli version 450 but does break in 466 at least.

ALso, Newly created projects which gets default network by default have all of the subnetworks with Google Private Access as off. that was not the case previously.

zli82016 commented 6 months ago

@tenstriker , can you please provide the configuration to reproduce the issue?

After upgrading google provider from 4.75.0 to 5.17.0 and then running the command terraform apply, an error occurs. Is that the issue?

tenstriker commented 6 months ago

@tenstriker , can you please provide the configuration to reproduce the issue?

After upgrading google provider from 4.75.0 to 5.17.0 and then running the command terraform apply, an error occurs. Is that the issue?

No it's actually happens with both. It used to work with 4.75.0 but suddenly started failing so I tried latest version and it still fails. I wonder if the gcp backend api changed. means TF was never sending the value but in backend it was considered False by default but recently considering it to be True now? Debug log is taken with version 5.17 which clearly doesn't send this param value in request body.

zli82016 commented 6 months ago

Thanks for the information, @tenstriker . Is it possible to provide the configuration?

tenstriker commented 6 months ago

sure, I added in OP. I tried tweaking some parameters of cluster_config like removing autoscaling, optional_components, changing image version etc to no avail

zli82016 commented 6 months ago

Forward this issue to the service team to check the reason for the error message.

tenstriker commented 6 months ago

thanks. fyi, I ran both versions (4.75.0 and 5.17.0) with Terraform v1.7.4 and none of them seem to have internal_ip_only in request body.

tenstriker commented 6 months ago

I think this might also have to do with image version I am using 2.2.0-RC3-debian11

tenstriker commented 6 months ago

Tested another 2.2. image 2.2.3-debian12 . It works with dataproc image version 2.1. without any issue. so seem like dataproc 2.2.* image versions may have to do with this? I

tenstriker commented 6 months ago

I added some more updates in OP

cnauroth commented 1 month ago

I don't think this is a bug in Terraform, and perhaps we can close this issue. Instead, I think this was a change introduced in Dataproc for stronger security defaults:

https://cloud.google.com/dataproc/docs/release-notes#February_16_2024

Dataproc on Compute Engine: The internalIpOnly cluster configuration setting now defaults to true for clusters created with 2.2 image versions. Also see Create a Dataproc cluster with internal IP addresses only.

The timing of that release roughly correlates with the date this issue was created.

cnauroth commented 1 month ago

...although #18503 suggests that even if you explicitly set internal_ip_only = false, it's not respected. That part would definitely be a bug.

zli82016 commented 4 weeks ago

...although #18503 suggests that even if you explicitly set internal_ip_only = false, it's not respected. That part would definitely be a bug.

This is still a bug for this case and needs to be fixed. I will leave this Github issue open.