googleapis / python-bigquery

Apache License 2.0
746 stars 306 forks source link

Select columns with `extract_table`? #2021

Closed ReubenFrankel closed 1 month ago

ReubenFrankel commented 2 months ago

Is it possible to select specific columns for export with extract_table somehow? Or do I need to use query for that kind of additional functionality?

peter-s-martin commented 1 month ago

Try this alternate way to export the specific columns from the table rather than exporting entire table

EXPORT DATA OPTIONS( uri='gs://bucket/folder/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=';') AS SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10

ReubenFrankel commented 1 month ago

Try this alternate way to export the specific columns from the table rather than exporting entire table

EXPORT DATA OPTIONS( uri='gs://bucket/folder/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=';') AS SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10

Thanks, I was aware that this is possible through query, but I want to be able to do this with extract_table where you can already specify export format, compression and more with the job_config kwarg:

# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'

destination_uri = "gs://{}/{}".format(bucket_name, "shakespeare.csv.gz")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table("shakespeare")
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
    job_config=job_config,
)  # API request
extract_job.result()  # Waits for job to complete.

https://cloud.google.com/bigquery/docs/exporting-data#extract_compressed_table

Linchin commented 1 month ago

Thanks @peter-s-martin for answering the question! Indeed we can use EXPORT DATA to extract only specific columns of a table.

@ReubenFrankel You can specify the same configs with EXPORT DATA: see the export_option_list section of the documentation of the statement.

I will close the issue for now, but feel free to leave a comment or open a new issue if you have any further questions.

ReubenFrankel commented 1 month ago

This question was more about whether there is a way to select columns for extract programmatically with extract_table, rather than provide the SQL to execute an extract job as a string to query (I was already aware of this as an alternative). In terms of DX, I prefer the interface extract_table provides and having an ExtractJob instance returned. Since you both directed me to the SQL approach, I'm gonna assume it's not possible with extract_table currently.

For posterity, here's what I imagined the API might look like:

extract_job = client.extract_table(
    table_ref,
    columns=[
        "event_name",
        "event_timestamp",
    ],
    destination_uri=destination_uri,
    location="US",
    job_config=job_config,
)

or

job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.columns = [
    "event_name",
    "event_timestamp",
]

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    location="US",
    job_config=job_config,
)
Linchin commented 1 month ago

Indeed, I don't think there's an option to specify columns directly in the ExtractJob API. It does feel like a useful feature, though. If you are interested in making it a feature, it would be helpful to file a feature request with customer support here: https://cloud.google.com/bigquery/docs/getting-support.