spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Make use of parquet format instead of avro #61

Closed yu-iskw closed 6 years ago

yu-iskw commented 6 years ago

Hi @nevillelyh

We use avro format to save a dataframe to GCS before loading the avro files to bigquery. One of the biggest advantages of avro format is that bigquery can read the schema from the avro metadata. However, avro format doesn't support timestamp type. So we need a twists to store a dataframe which includes timestamp columns.

I guess parquet support timestamp type. Moreover, if bigquery can load parquet files on GCS without explicit schema, it would be better to use parquet format. What do you think?

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

nevillelyh commented 6 years ago

You could give it a shot. But I'm not sure the timestamp support alone justifies the cost.

yu-iskw commented 6 years ago

Thank you for the comment. You make good points. We should miss the pros of avro against only the timestamp type benefit. We can also modify spark-avro itself. I think it would be better for our case.

https://github.com/databricks/spark-avro/issues/229