pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #59242

Open dbrownems opened 1 month ago

dbrownems commented 1 month ago

Pandas version checks

Reproducible Example

import sqlite3
import pandas as pd

db = sqlite3.connect("file::memory:?cache=shared")

query = f"""

select cast(x'0123456789abcdef0123456789abcdef' as blob) a

"""

df = pd.read_sql(query, db, dtype_backend='pyarrow')

display(df)

Issue Description

This fails with

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte

The repro is for sqllite, but the issue is the same with sqlalchemy and pyodbc.

Also read_sql_table fails with the same error.

Expected Behavior

Should succeed and return a dataframe with a binary column. It works with the default backend.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.158.2-1.cm2 Version : #1 SMP Sun Jun 9 18:33:38 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.24.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.1.2 Cython : 3.0.4 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2023.10.0 gcsfs : None matplotlib : 3.7.2 numba : 0.57.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.22 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : 0.21.0 tzdata : 2023.3 qtpy : None pyqt5 : None
asishm commented 1 month ago

Thanks for the report. This was likely introduced with #50048 when the nullable keyword (now renamed to dtype_backend) was added to sql functions.

There is a comment https://github.com/pandas-dev/pandas/pull/50048#discussion_r1184376583 which is where it assumes it's string data. cc @phofl

Full Traceback

Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte
  File "/home/asishm/pandas-asishm/pandas/core/arrays/string_.py", line 412, in _from_sequence
    result = lib.ensure_string_array(scalars, na_value=libmissing.NA, copy=copy)
  File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 972, in convert
    arr = arr_cls._from_sequence(arr, dtype=new_dtype)
  File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 993, in <listcomp>
    arrays = [convert(arr) for arr in content]
  File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 993, in convert_object_array
    arrays = [convert(arr) for arr in content]
  File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 161, in _convert_arrays_to_dataframe
    arrays = convert_object_array(
  File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 198, in _wrap_result
    frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)
  File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 2738, in read_query
    frame = _wrap_result(
  File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 691, in read_sql
    return pandas_sql.read_query(
  File "/home/asishm/pd-issues/59242.py", line 12, in <module>
    df = pd.read_sql(query, db, dtype_backend='pyarrow')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte
sukriti1 commented 1 month ago

take