googleapis / python-bigquery-storage

Apache License 2.0
116 stars 46 forks source link

feat!: convert arrow data types to Python types in `rows()` #296

Open tswast opened 3 years ago

tswast commented 3 years ago

I notice we just loop through all the rows in a record batch here:

https://github.com/googleapis/python-bigquery-storage/blob/f7fc4ec45e62739139bf865497ce2e5fbd948d31/google/cloud/bigquery_storage_v1/reader.py#L695

This might result in some odd types, such as Int64Scalar and TimestampScalar sneaking in.

I think we probably want to call to_pydict before looping over rows/columns in the BigQuery Storage API client.

I'm filing this issue in python-bigquery at first, because I think it's worth investigating if this is actually happening first.

plamut commented 3 years ago

I did a quick test and queried some public data, fetching the results as dataframe using BQ Storage API under the hood.

The execution flow reached _ArrowStreamParser.to_arrow(), but did not hit the to_rows() method linked in the issue description. Seems like it's not an issue when using the BigQuery client.

The script used in the test (and the debugger to step through):

from google.cloud import bigquery

PROJECT_ID = "bigquery-public-data"
DATASET_ID = "chicago_taxi_trips"
TABLE_ID = "taxi_trips"

table_name_full = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"

client = bigquery.Client()

sql = f"""
SELECT taxi_id, trip_start_timestamp, trip_seconds, tips
FROM `{table_name_full}`
LIMIT 5;
"""

query_job = client.query(sql)
result = query_job.result()

df = result.to_dataframe()

print(df)

FWIW, calling the to_rows() method by hand indeed results in pyarrow values contained in the row dict, example:

{
    'taxi_id': <pyarrow.StringScalar: 'bc709a696db40a46144faa530198a65442402f42513ee44c63cc9bd1e83cadb1ef6d04b8d09bdbef85cf4b72236deff5644913be7e313a694821ce08545564bb'>,
    'trip_start_timestamp': <pyarrow.TimestampScalar: datetime.datetime(2013, 5, 24, 14, 15, tzinfo=<UTC>)>,
    'trip_seconds': <pyarrow.Int64Scalar: 480>,
    'tips': <pyarrow.DoubleScalar: 0.0>
}

(and yes, calling record_batch.to_pydict() in to_rows() would help, if necessary)


I also checked DB API and there we do hit the to_rows() method, but we convert pyarrow values to Python values in a helper.

It would be interesting to benchmark this and see if arrow to Python conversion is more efficient if done with a to_pydict() call instead of individual as_py() calls. Something to consider if/when we change the BQ Storage client.

from google.cloud.bigquery import dbapi

...
print("Fetching data using DBAPI")
cursor = dbapi.connect(client).cursor()
cursor.execute(sql)

rows = cursor.fetchall()

print(rows)

P.S.: Changing the BQ Storage client will likely break the BigQuery DB API, thus we need to coordinate releases and implement conditional code based on the detected BQ Storage version, i.e. whether to call or not call the as_py() method on values.

tswast commented 3 years ago

Sweet. I had forgotten

https://github.com/googleapis/python-bigquery/blob/c9068e4191dbe3632fe399a0b777e8bc54a183a6/google/cloud/bigquery/dbapi/_helpers.py#L468-L470

Thanks for investigating!

It does seem like we should change this in the BQ Storage client, but this will be good to keep in mind that we'd need to coordinate.

I'll move this to that repo as a feature request.