googleapis / python-bigquery

Apache License 2.0
746 stars 306 forks source link

ValueError encountered when to_dataframe returns empty resultset with JSON field #1580

Open jdub55 opened 1 year ago

jdub55 commented 1 year ago

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Environment details

(bigquery_reproduce) jupyter@gke-db-executor:~/bigquery_reproduce$ python --version
Python 3.9.0
(bigquery_reproduce) jupyter@gke-db-executor:~/bigquery_reproduce$ pip --version
pip 23.1.2 from /opt/conda/envs/bigquery_reproduce/lib/python3.9/site-packages/pip (python 3.9)
(bigquery_reproduce) jupyter@gke-db-executor:~/bigquery_reproduce$ pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 3.11.0
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author-email: googleapis-packages@google.com
Location: /opt/conda/envs/bigquery_reproduce/lib/python3.9/site-packages
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
(bigquery_reproduce) jupyter@gke-db-executor:~/bigquery_reproduce$ pip show google-cloud-bigquery-storage
Name: google-cloud-bigquery-storage
Version: 2.20.0
Summary: Google Cloud Bigquery Storage API client library
Home-page: https://github.com/googleapis/python-bigquery-storage
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /opt/conda/envs/bigquery_reproduce/lib/python3.9/site-packages
Requires: google-api-core, proto-plus, protobuf

(bigquery_reproduce) jupyter@gke-db-executor:~/bigquery_reproduce$ pip list
Package                       Version
----------------------------- --------
cachetools                    5.3.1
certifi                       2023.5.7
charset-normalizer            3.1.0
db-dtypes                     1.1.1
google-api-core               2.11.0
google-auth                   2.19.1
google-cloud-bigquery         3.11.0
google-cloud-bigquery-storage 2.20.0
google-cloud-core             2.3.2
google-crc32c                 1.5.0
google-resumable-media        2.5.0
googleapis-common-protos      1.59.0
grpcio                        1.55.0
grpcio-status                 1.55.0
idna                          3.4
numpy                         1.24.3
packaging                     23.1
pandas                        2.0.2
pip                           23.1.2
proto-plus                    1.22.2
protobuf                      4.23.2
pyarrow                       12.0.0
pyasn1                        0.5.0
pyasn1-modules                0.3.0
python-dateutil               2.8.2
pytz                          2023.3
requests                      2.31.0
rsa                           4.9
setuptools                    67.7.2
six                           1.16.0
tzdata                        2023.3
urllib3                       1.26.16
wheel                         0.40.0

Steps to reproduce

When bigquery return empty datase. bigquery python sdk will need to transfer [bq schema to arrow schema

https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/table.py#L1844-L1853),

so it will execute bq_to_arrow_data_type https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L225-L246

Looks like currently this no JSON type mapping]: https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L147-L162)

Code example

Error:

/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py:267:
UserWarning: Unable to determine type for field 'json'.

warnings.warn("Unable to determine type for field
'{}'.".format(bq_field.name))

Traceback (most recent call last):

File "/home/jupyter/smapi/test_bq.py", line 12, in <module>

df = bqexecutor.execute_query(sql)

File "/home/jupyter/smapi/smapi/core/data_providers/sql/bq_executor.py",
line 119, in execute_query

querydf = rows.to_dataframe()

File
"/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/[table.py](https://table.py/)",
line 2151, in to_dataframe

progress_bar_type=progress_bar_type,

File
"/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/[table.py](https://table.py/)",
line 1840, in to_arrow

return pyarrow.Table.from_batches(record_batches)

File "pyarrow/table.pxi", line 3747, in pyarrow.lib.Table.from_batches

ValueError: Must pass schema, or at least one RecordBatch

Python version: Python 3.9.0

package version:

google-cloud-bigquery 3.10.0

google-cloud-bigquery-storage 2.19.1

pyarrow 8.0.0

pandas 1.5.3

How to reproduce?

test_bq.py

python

from google.cloud import bigquery

client = bigquery.Client()

table_id = "gdw-dev-gdml.abehsu.reproduce_issue"

# Define schema

schema = [

bigquery.SchemaField("field1", "STRING", mode="REQUIRED"),

bigquery.SchemaField("field2", "JSON", mode="REQUIRED"),

]

table = bigquery.Table(table_id, schema=schema)

table = client.create_table(table) # Make an API request.

print(

"Created table {}.{}.{}".format(table.project, table.dataset_id,
table.table_id)

)

# Perform a query.

QUERY = (

f"""

select * from {table_id}

"""

)

query_job = client.query(QUERY) # API request

df = query_job.to_dataframe()

print(df)

python test_bq.py

Created table gdw-dev-gdml.abehsu.reproduce_issue

Stack trace


/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py:267:
UserWarning: Unable to determine type for field 'field2'.

warnings.warn("Unable to determine type for field
'{}'.".format(bq_field.name))

Traceback (most recent call last):

File "/home/jupyter/smapi/test_bq.py", line 41, in <module>

df = query_job.to_dataframe()

File
"/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/job/[query.py](https://query.py/)",
line 1800, in to_dataframe

return query_result.to_dataframe(

File
"/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/[table.py](https://table.py/)",
line 2150, in to_dataframe

record_batch = self.to_arrow(

File
"/opt/conda/envs/smapi/lib/python3.9/site-packages/google/cloud/bigquery/[table.py](https://table.py/)",
line 1848, in to_arrow

return pyarrow.Table.from_batches(record_batches, schema=arrow_schema)

File "pyarrow/table.pxi", line 3747, in pyarrow.lib.Table.from_batches

ValueError: Must pass schema, or at least one RecordBatch

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

chalmerlowe commented 1 year ago

OP points out the mappings for BQ to arrow datatypes and it definitely appears that JSON is not included there OR in any of the other translation functions adjacent to that code.