pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

Open galipremsagar opened 4 years ago

galipremsagar commented 4 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[42]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint64")})
In[43]: df.to_parquet('a')
In[44]: pd.read_parquet('a').dtypes
Out[44]: 
a    uint64
dtype: object
In[45]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
In[46]: df.to_parquet('a')
In[47]: pd.read_parquet('a').dtypes
Out[47]: 
a    int64
dtype: object

Problem description

It appears to be that uint32 is not being preserved like uint64 is being preserved while round-tripping through a parquet file.

Expected Output

Preserve the uint32 dtype.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276bc116c438d3fdee492026f8223584c477 python : 3.7.8.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-52-generic Version : #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.3 numpy : 1.19.2 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.4 setuptools : 49.6.0.post20201009 Cython : 0.29.21 pytest : 6.1.1 hypothesis : 5.37.3 sphinx : 3.2.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.18.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : 0.8.4 fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 1.0.1 pytables : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.51.2

Crosslinking to cudf fuzz-testing for tracking purpose: https://github.com/rapidsai/cudf/issues/6001

allenmac347 commented 3 years ago

take Hello I am a first time contributor and I'm willing to examine this further!

allenmac347 commented 3 years ago

@galipremsagar Hi, after further testing it seems like uint32 is preserved when using 'fastparquet' as the engine for to_parquet and read_parquet. However, the closed issue #31896 seems to acknowledge this behavior and a fix was introduced and merged into main branch to make pandas interpret the written uint32 data as int64 data. I was wondering if you think this could be expected behavior or would this still be considered a ongoing issue?

galipremsagar commented 3 years ago

@allenmac347 I think it'd still be considered an issue, we might probably need a fix similar to : https://github.com/pandas-dev/pandas/pull/31918

allenmac347 commented 3 years ago

@jorisvandenbossche hey so I noticed that in the issue you fixed in issue #31896, you said that parquet does not seem to be able to store uint32. Would you happen to know more about this issue and if this is an issue with pyarrow or pandas? thanks!

allenmac347 commented 3 years ago

@phofl Hi phofl. I'm currently trying to debug this issue, but it seems like this might be an external problem with pyarrow. Here's some interesting output I get:

//The datatype uint32 is preserved here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='fastparquet') dataframe = pd.read_parquet('a', engine='fastparquet')

//The datatype uint32 is preserved here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='fastparquet') dataframe = pd.read_parquet('a', engine='pyarrow')

//The datatype uint32 is read in as a int64 here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='pyarrow') dataframe = pd.read_parquet('a', engine='fastparquet')

I feel like this means there's something wrong with how pyarrow writes uint32 to a file. I was wondering if you've had any suggestions? I've tried using the pandas metadata of the parquet file to just convert the dataframe back to uint32 after reading it in, but that made a lot of test cases fail.

phofl commented 3 years ago

Unfortunately I am not that familiar here. Do you know or could find out who implemented the pyarrow engine?

jorisvandenbossche commented 3 years ago

Sorry for the slow reply here. This is not directly related to #31896 (that was a bug on the conversion on our side, specifically for nullable dtypes), but it is actually a limiation of pyarrow.

You can specify version="2.0", and then pyarrow will use additional type annotations in the parquet file, in which case it can actually preserve uint32. But by default it indeed does not.

So there is nothing to do on the pandas side about it (apart from maybe better documenting this). A similar issue about this on the pyarrow side is https://issues.apache.org/jira/browse/ARROW-9215

miccoli commented 2 years ago

While googling for solving this very same problem, I found this very useful thread.

Let me just add that version="2.0" is deprecated now, use instead "2.4" or "2.6".

I understand that it may sound obvious, but maybe adding in the pandas docs an explicit link on where to look for the additional **kwargs to be passed to the undelying engine could be useful. In fact it took me a while to figure out that the relevant docs for pyarrow are in pyarrow.parquet.ParquetWriter.