hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.32k stars 1.73k forks source link

Dataproc Job creation fails for Spark SQL job with `query_file_uri` set #13278

Open mustaFAB53 opened 1 year ago

mustaFAB53 commented 1 year ago

Community Note

Terraform Version

Local TF version: 1.3.6 Google Provider Version: 1.46.0

Affected Resource(s)

Terraform Configuration Files

resource "google_dataproc_job" "spark-sql-tables" {

  region       = var.region
  project      = var.project

  placement {
    cluster_name = module.dataproc-cluster[0].dataproc_cluster_name
  }

  sparksql_config {
    query_file_uri = "gs://${var.artifacts_bucket}/sql/tables.sql"
    jar_file_uris  = local.jar_file_uris_list
  }
}

Debug Output

TF Debug Log: https://gist.github.com/mustaFAB53/455a8955bff34fe70480e5558384c69b

It works when tried via gcloud CLI Gcloud CLI Debug Log: https://gist.github.com/mustaFAB53/b1f841e2b4ed318b3e84166991e3d94a

Expected Behavior

Dataproc Job should get created successfully using the query_file_uri set to GCS URI for Spark SQL Job

Actual Behavior

Resource creation failed with error:

│ Error: googleapi: Error 400: Invalid value at 'job.spark_sql_job' (oneof), oneof field 'queries' is already set. Cannot set 'queryList'
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.BadRequest",
│     "fieldViolations": [
│       {
│         "description": "Invalid value at 'job.spark_sql_job' (oneof), oneof field 'queries' is already set. Cannot set 'queryList'",
│         "field": "job.spark_sql_job"
│       }
│     ]
│   }
│ ]
│ , invalid

Steps to Reproduce

  1. Try creating a dataproc job resource using configuration I shared earlier and pointing query_file_uri to GCS storage bucket URL
  2. terraform apply

Findings

When I compared the payload being sent by terraform and gcloud CLI, I could see that queryList is being passed with empty map for terraform while the payload sent via gcloud CLI doesn't have the same as I am using queryFileURI Please refer the payloads of both below:

Terraform

{
 "job": {
  "placement": {
   "clusterName": "dataproc-query-cluster1"
  },
  "reference": {
   "projectId": "REDACTED"
  },
  "sparkSqlJob": {
   "jarFileUris": [
    "gs://REDACTED/executables/packages/spark-sql-kafka-0-10_2.12-3.1.2.jar",
    "gs://REDACTED/executables/packages/delta-core_2.12-1.0.0.jar",
    "gs://REDACTED/executables/packages/config-1.4.2.jar",
    "gs://REDACTED/executables/packages/spark-token-provider-kafka-0-10_2.12-3.1.2.jar",
    "gs://REDACTED/executables/packages/kafka-clients-2.6.0.jar",
    "gs://REDACTED/executables/packages/commons-pool2-2.6.2.jar",
    "gs://REDACTED/executables/packages/unused-1.0.0.jar",
    "gs://REDACTED/executables/packages/zstd-jni-1.4.8-1.jar",
    "gs://REDACTED/executables/packages/lz4-java-1.7.1.jar",
    "gs://REDACTED/executables/packages/snappy-java-1.1.8.2.jar",
    "gs://REDACTED/executables/packages/slf4j-api-1.7.30.jar",
    "gs://REDACTED/executables/packages/antlr4-4.7.jar",
    "gs://REDACTED/executables/packages/antlr4-runtime-4.7.jar",
    "gs://REDACTED/executables/packages/antlr-runtime-3.5.2.jar",
    "gs://REDACTED/executables/packages/ST4-4.0.8.jar",
    "gs://REDACTED/executables/packages/org.abego.treelayout.core-1.0.3.jar",
    "gs://REDACTED/executables/packages/javax.json-1.0.4.jar",
    "gs://REDACTED/executables/packages/icu4j-58.2.jar"
   ],
   "queryFileUri": "gs://REDACTED/sql/tables.sql",
   "queryList": {}
  }
 }
}

Gcloud CLI

{
  "reference": {
    "projectId": "REDACTED",
    "jobId": "b83ef8e86c854894a2791b5280a7b952"
  },
  "placement": {
    "clusterName": "dataproc-query-cluster1",
    "clusterUuid": "898d958b-4087-4e0b-8efb-2133d557e062"
  },
  "status": {
    "state": "PENDING",
    "stateStartTime": "2022-12-17T04:56:02.697442Z"
  },
  "sparkSqlJob": {
    "queryFileUri": "gs://REDACTED/sql/tables.sql",
    "jarFileUris": [
      "gs://REDACTED/executables/packages/spark-sql-kafka-0-10_2.12-3.1.2.jar",
      "gs://REDACTED/executables/packages/delta-core_2.12-1.0.0.jar",
      "gs://REDACTED/executables/packages/config-1.4.2.jar",
      "gs://REDACTED/executables/packages/spark-token-provider-kafka-0-10_2.12-3.1.2.jar",
      "gs://REDACTED/executables/packages/kafka-clients-2.6.0.jar",
      "gs://REDACTED/executables/packages/commons-pool2-2.6.2.jar",
      "gs://REDACTED/executables/packages/unused-1.0.0.jar",
      "gs://REDACTED/executables/packages/zstd-jni-1.4.8-1.jar",
      "gs://REDACTED/executables/packages/lz4-java-1.7.1.jar",
      "gs://REDACTED/executables/packages/snappy-java-1.1.8.2.jar",
      "gs://REDACTED/executables/packages/slf4j-api-1.7.30.jar",
      "gs://REDACTED/executables/packages/antlr4-4.7.jar",
      "gs://REDACTED/executables/packages/antlr4-runtime-4.7.jar",
      "gs://REDACTED/executables/packages/antlr-runtime-3.5.2.jar",
      "gs://REDACTED/executables/packages/ST4-4.0.8.jar",
      "gs://REDACTED/executables/packages/org.abego.treelayout.core-1.0.3.jar",
      "gs://REDACTED/executables/packages/javax.json-1.0.4.jar",
      "gs://REDACTED/executables/packages/icu4j-58.2.jar"
    ]
  },
  "driverControlFilesUri": "gs://REDACTED/google-cloud-dataproc-metainfo/898d958b-4087-4e0b-8efb-2133d557e062/jobs/b83ef8e86c854894a2791b5280a7b952/",
  "driverOutputResourceUri": "gs://REDACTED/google-cloud-dataproc-metainfo/898d958b-4087-4e0b-8efb-2133d557e062/jobs/b83ef8e86c854894a2791b5280a7b952/driveroutput",
  "jobUuid": "92264add-3758-39dc-b207-ace49de9e59e"
}

b/303808492

edwardmedia commented 1 year ago

This looks like to be a terraform core issue.

query_file_uri and query_list are mutual exclusive. query_list is defined as a list. When query_file_uri is provided, terraform sends an empty list for query_list, which causes this issue. The expected behavior should be the query_list is excluded in the payload.

ExactlyOneOf: []string{"sparksql_config.0.query_file_uri", "sparksql_config.0.query_list"}
  "sparkSqlJob": {
   "jarFileUris": [
    "gs://REDACTED/executables/packages/spark-sql-kafka-0-10_2.12-3.1.2.jar",
    ....
   ],
   "queryFileUri": "gs://REDACTED/sql/tables.sql",
   "queryList": {}
  }
mustaFAB53 commented 1 year ago

yes @edwardmedia, it seems so

mustaFAB53 commented 1 year ago

@edwardmedia, any updates on this?

mustaFAB53 commented 1 year ago

@edwardmedia, please let me know what will be next steps to resolve this issue. Its a blocker for our infrastructure automation workflow

mustaFAB53 commented 1 year ago

Hi @edwardmedia, please provide your valuable inputs