Exporting data to Google Cloud Storage in Parquet format available but undocumented

pegoenrico commented 2 weeks ago

Hello all. I'm trying to export queried data from a BigQuery database table. Since the resulting table can be large (2.5GB or more), I followed the suggestion "Larger datasets" from the bq_table_download() help, and I used bq_table_save() to save the data in multiple files in Google Cloud Storage.

When I tried to apply bq_table_save(), I discovered an undocumented option to export the files: destination_format = "PARQUET" in place of "NEWLINE_DELIMITED_JSON" or "CSV". If I use this parameter, bq_table_save() saves correctly the data in multiple "parquet" files.

Can I use this option without problems? It seems to me that it works very well: it is very performant, and the use of parquet files saves me a lot of work to check data types.

The following code summarizes at most the code I used to export data succesfully to a Google Cloud Storage bucket:

project_id  <- "<project identifier>"
sql_dwn <- "SELECT * FROM <table from which to extract data>"
tb <- bq_project_query(project_id, sql_dwn)
bq_table_save(tb, destination_uris = "destination_bucket/folder/filename_*.parquet", destination_format="PARQUET")

Thank you in advance for your help.

pegoenrico commented 2 weeks ago

Does anyone help me, please?

apalacio9502 commented 6 days ago

Hi @pegoenrico,

The Parquet format is supported, according to the BigQuery documentation (https://cloud.google.com/bigquery/docs/exporting-data), and in this case, the library documentation needs to be updated.

I expect that in a few days, the documentation for the development version will be updated https://github.com/r-dbi/bigrquery/pull/618.

If you use the parameter destination_format = "PARQUET", please note that the supported compression formats are "SNAPPY", "GZIP", "ZSTD", or "NONE".

Regards,

pegoenrico commented 6 days ago

Hi @apalacio9502, thank you very much for your update. Now I'll feel free to use the PARQUET format to export data from BigQuery tables. Best regards! Enrico

r-dbi / bigrquery

Exporting data to Google Cloud Storage in Parquet format available but undocumented #614