googleapis / python-bigquery

Apache License 2.0
742 stars 302 forks source link

load_table_from_dataframe support JSON column dtype #1966

Open jlynchMicron opened 4 months ago

jlynchMicron commented 4 months ago

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe. python-bigquery does not seem to currently support uploading dataframes where one of the columns in the destination table is JSON dtype.

Quick partial code example:

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(schema=schema)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

Result: google.api_core.exceptions.BadRequest: 400 Unsupported field type: JSON; reason: invalid, message: Unsupported field type: JSON

Describe the solution you'd like Implement support for loading data to BigQuery that contain JSON columns.

Additional context

$ conda list | grep bigquery
google-cloud-bigquery-core 3.25.0             pyhd8ed1ab_0  
google-cloud-bigquery-storage 2.25.0             pyhca7485f_0  
google-cloud-bigquery-storage-core 2.25.0             pyhca7485f_0

Related issues: https://github.com/googleapis/python-bigquery-sqlalchemy/issues/399 https://github.com/googleapis/python-bigquery-pandas/issues/698 https://github.com/googleapis/python-bigquery-dataframes/issues/816

jlynchMicron commented 4 months ago

I have also tried uploading the data as dtype STRING in hopes that it would be converted to JSON server-side, but that also results in the following error: google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/gdw-dev-sse/jobs?uploadType=resumable: Provided Schema does not match Table XXXXXXXXXXXXXXXX. Field summery_data has changed type from JSON to STRING NOTE: Hiding table name

jlynchMicron commented 4 months ago

I was able to upload my data with the following code, but JSON support should be added for pandas dataframes.

Code:

data['summery_datetime'] = data['summery_datetime'].astype(str) #Datetime objects are not JSON serializable, convert to datetime string.
data = data.to_dict(orient='records')
job = client.load_table_from_json(data, table_obj, job_config=job_config)
print(job.result())
tswast commented 3 months ago

Thanks @jlynchMicron for providing a workaround. I think there are a few problems we'll need to work through, one of which is that the bigquery backend doesn't support JSON in load jobs from Parquet files: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions Please file a Feature Request https://cloud.google.com/support/docs/issue-trackers (specifically, Create new BigQuery issue).

Another possible workaround is to use CSV as the source format. In your example

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(
    schema=schema,
    source_format="CSV",
)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

For googlers watching this issue, I have proposed a design go/bf-json which proposes a JSONDtype in https://github.com/googleapis/python-db-dtypes-pandas which would allow us to autodetect when JSON is used in a DataFrame. Before that though, we could do the same and choose the appropriate serialization format depending on the provided schema. For example, parquet must be used with STRUCT/ARRAY columns, but CSV must be used for JSON.

Linchin commented 3 months ago

https://github.com/googleapis/python-db-dtypes-pandas/pull/284 can potentially fulfill this feature request.