hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.32k stars 1.73k forks source link

Google Cloud Data flow - execution parameters are not configurable in Terraform - (diskSizeGb, workerDiskType,workerMachineType ) #1504

Open ghost opened 6 years ago

ghost commented 6 years ago

Affected Resource(s)

This issue was originally opened by @karthik-papajohns as hashicorp/terraform#18073. It was migrated here as a result of the provider split. The original body of the issue is below.


Terraform Version

Terraform v0.11.5

Terraform Configuration Files

...

Debug Output

Crash Output

Expected Behavior

Expected terraform to have Google cloud data flow execution parameters(diskSizeGb, workerDiskType,workerMachineType ) configurable.

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Actual Behavior

No references of execution parameters for Google cloud dataflow are found in terraform official documentation.

https://www.terraform.io/docs/providers/google/r/dataflow_job.html

Additional Context

b/351028604

MaxBinnewies commented 5 years ago

I have found a dirty work-around:

  1. I downloaded the Google template I wanted to deploy to dataflow from Google's template bucket with the gcloud cli tool.
  2. For a a custom pipeline we wrote in Apache Beam in Java, I set the parameter workerMachineType to what I wanted and then wrote the pipeline to a template file instead of deploying it to GCP.
  3. Then I looked through the template I just created and manually copied everything relating to "machineType" over to the google template previously downloaded. There were a total of three places:

At the top in "options": "zone" : null, "workerMachineType" : "n1-standard-1", "gcpTempLocation" : "gs://dataflow-staging-us-central1-473832897378/temp/",

Again at the bottom of sdkPipelineOptions: }, { "namespace" : "org.apache.beam.runners.dataflow.options.DataflowPipelineOptions", "key" : "templateLocation", "type" : "STRING", "value" : "gs://dataflow-templates-staging/2018-10-08-00_RC00/PubSub_to_BigQuery" }, { "namespace" : "org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions", "key" : "workerMachineType", "type" : "STRING", "value" : "n1-standard-1" } ] },

And finally in "workerPools": "dataDisks" : [ { } ], "machineType" : "n1-standard-1", "numWorkers" : 0,

  1. I used the Terraform resource "google_storage_bucket_object" to upload this file into a bucket in my gcp project.
  2. Finally I just point "template_gcs_path" in "google_dataflow_job" to the just uploaded location in my bucket instead of the standard google template.

I realise it is a bit hacky, but it works. The pipeline gets successfully deployed on a n1-standard-1 Compute Engine instead of the default n1-standard-4.

tysen commented 5 years ago

machine_type is now configurable. The others aren't yet.

pkatsovich commented 4 years ago

Is there any planned timeline for implementing diskSizeGb to be configurable as well? As documented in Google Dataflow's common error guidance we'd like to be able to manage the workers Disk Size when managing Dataflow jobs with terraform.

eliasscosta commented 3 years ago

Another feature nice to have is the possibility to set numbers of workers.

tjwebb commented 3 years ago

+1, would like to be able to set disk_size_gb and worker_disk_type

roaks3 commented 3 months ago

This is a slightly weird API because we actually call a Launch endpoint that only seems to support https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment, as opposed to being able to configure the WorkerPool directly as described in https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.WorkerPool.

It looks like the current state of the request is:

I'm forwarding to the service team to weigh in, but IMO we will likely want to split those last two items into separate tickets, since disk_size_gb should be significantly easier to implement.