gluent / goe

GOE: a simple and flexible way to copy data from an Oracle Database to Google BigQuery.
Apache License 2.0
8 stars 2 forks source link

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

Open nj1973 opened 1 month ago

nj1973 commented 1 month ago

From https://spark.apache.org/docs/latest/cloud-integration.html:

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety. spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

The page also lists Google Cloud Storage (gs) as a safe object store. Therefore when staging to GCS we should use this and can add the following to the offload.env.template.bigquery template file:

export OFFLOAD_TRANSPORT_SPARK_PROPERTIES='{"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2}'

We need to verify the information above is still accurate before working on this.

nj1973 commented 1 month ago

This should also apply to Snowflake when using GCS/Azure transport.