googleapis / python-bigquery

Apache License 2.0
746 stars 306 forks source link

BigQuery Python client fails to query JSON field with heterogeneous data types #1980

Open Hugo-Polloli opened 4 months ago

Hugo-Polloli commented 4 months ago

Environment details

Steps to reproduce

  1. Create a table like so :
    CREATE OR REPLACE TABLE `my_table`
    AS
    SELECT JSON'[{"foo":"A"},{"foo":5}]' AS DATA
  2. Query it from python :
    
    from google.cloud.bigquery import Client

client = Client()

df = client.query("SELECT * FROM my_table",to_dataframe()

print(df["DATA"][0])


#### Expected behavior
We should see printed :

[{'foo':'A'},{'foo':5}]


#### Stack trace

Traceback (most recent call last): File "/home/hugo/test/test.py", line 9, in ).to_dataframe() ^^^^^^^^^^^^^^ File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 2053, in to_dataframe return query_result.to_dataframe( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 2379, in to_dataframe record_batch = self.to_arrow( ^^^^^^^^^^^^^^ File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 1946, in to_arrow for record_batch in self.to_arrow_iterable( File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/table.py", line 1809, in _to_page_iterable yield from result_pages File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 731, in download_arrow_row_iterator yield _row_iterator_page_to_arrow(page, column_names, arrow_types) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/hugo/test/venv/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 704, in _row_iterator_page_to_arrow arrays.append(pyarrow.array(page._columns[column_index], type=arrow_type)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 368, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert 'A' with type str: tried to convert to int64



The error suggests that PyArrow is attempting to convert all values for the "foo" key to int64, based on the presence of the integer value 5. However, it fails when encountering the string value "A" for the same key.

The solution would be to make the type inference during JSON data ingestion to handle heterogeneous data type for the same key.

One thing of note, is that the JSON type is not yet explicitly handled in `_pandas_helper.py`, printing `arrow_types` inside of `download_arrow_row_iterator` during the execution of the above code yields `[None]` as no mapping exist from the JSON bigquery type to any pyarrow type.