googleapis / python-bigquery

Apache License 2.0
746 stars 307 forks source link

The DB API Binary function should accept bytes. #628

Closed jimfulton closed 3 years ago

jimfulton commented 3 years ago
(3.8) jim@ds9:~/p/g/python-bigquery-sqlalchemy$ python
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import google.cloud.bigquery.dbapi
>>> google.cloud.bigquery.dbapi.Binary(b'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jim/p/g/python-bigquery/google/cloud/bigquery/dbapi/types.py", line 42, in Binary
    return string.encode("utf-8")
AttributeError: 'bytes' object has no attribute 'encode'

Bytes are the most common way to represent binary data. Accepting strings, as it does now seems at best to be a convenience and at worst a bug magnet.

In SQLAlchemy, if you defined a model that has a binary attribute, you'd store bytes data in it, but that would break for bigquery, di to this issue.

Sqlite's Binary function requires bytes data.

I propose to change the function to accept bytes. For the sake of backward compatibility, I propose to continue to accept strings.

tswast commented 3 years ago

This makes sense to me. (Both continue to accept strings, and also support bytes)

I'm having trouble finding in the REST docs how we expect BYTES data type to be encoded, but presumably we have other support for this outside of the DB-API which we can reference.

jimfulton commented 3 years ago

At a higher level:

(unit-3-9) jim@ds9:~/p/g/python-bigquery$ python
Python 3.9.4 (default, Apr  8 2021, 08:01:43) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import google.cloud.bigquery.dbapi
/home/jim/p/g/python-bigquery/.nox/unit-3-9/lib/python3.9/site-packages/pandas/compat/__init__.py:97: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
>>> conn = google.cloud.bigquery.dbapi.connect()
>>> cursor = conn.cursor()
>>> cursor.execute("select %(:BYTES)s", [b'xxx'])
>>> list(cursor)
[Row((b'xxx',), {'f0_': 0})]
>>> cursor.execute("select %(:BYTES)s", ['yyy'])
>>> list(cursor)
[Row((b'\xcb,',), {'f0_': 0})]

Note that bytes input is handled correctly, but string input isn't.

The Binary function doesn't appear to be used by the Python client itself. When running unit tests, it's only called by a test that tests that it encodes strings.

jimfulton commented 3 years ago

I'm having trouble finding in the REST docs how we expect BYTES data type to be encoded, but presumably we have other support for this outside of the DB-API which we can reference.

From: https://cloud.google.com/bigquery/docs/reference/rest/v2/StandardSqlDataType

"Encoded as a base64 string per RFC 4648, section 4."

It appears that we encode this properly only for bytes data: https://github.com/googleapis/python-bigquery/blob/a3224337dac217ec07df83bf0ad570b7aa6d2ec9/google/cloud/bigquery/_helpers.py#L295-L299