pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.63k stars 17.92k forks source link

Cannot hash table with index containing mixed types including non utf-8 bytes strings #27215

Open stestagg opened 5 years ago

stestagg commented 5 years ago

Example

import pandas
from pandas.util import hash_pandas_object
hash_pandas_object(pandas.DataFrame({'a': [1,2]}, index=[1, b'\xff1']), encoding='latin1')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Problem description

This is pretty niche :)

If a table passed to hash_pandas_object has an index with mixed types, then this branch is followed: https://github.com/pandas-dev/pandas/blob/4e185fcaedfe75050a3aa4e9fa175f9579825388/pandas/core/util/hashing.py#L297-L298

which calls: vals.astype(str), (I'm assuming so that the values can be converted to useful python objects) where vals is a numpy array.

As shown here, this does not work if the array contains ascii-compatible byte values:

>>> np.array([b'\xff']).astype(str)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

There is an encoding argument that can be passed to hash_pandas_object but this is not used when converting the values to str.

Expected Output

1        XXXXXXXXXXXXXXXXXXXXXX
b'11'    XXXXXXXXXXXXXXXXXXXXXX
dtype: uint64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.125-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 pandas: 0.24.2 pytest: 5.0.0 pip: 19.1.1 setuptools: 40.8.0 Cython: 0.29.11 numpy: 1.18.0.dev0+13de4d8 scipy: 1.1.0 pyarrow: 0.13.0 xarray: None IPython: 7.5.0 sphinx: 2.1.2 patsy: 0.5.1 dateutil: 2.7.3 pytz: 2019.1 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.3.3 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: 1.3.5 pymysql: None psycopg2: None jinja2: 2.10.1 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
TomAugspurger commented 5 years ago

@stestagg do you have a proposed fix?