Open galipremsagar opened 4 years ago
take Hello I am a first time contributor and I'm willing to examine this further!
@galipremsagar Hi, after further testing it seems like uint32 is preserved when using 'fastparquet' as the engine for to_parquet and read_parquet. However, the closed issue #31896 seems to acknowledge this behavior and a fix was introduced and merged into main branch to make pandas interpret the written uint32 data as int64 data. I was wondering if you think this could be expected behavior or would this still be considered a ongoing issue?
@allenmac347 I think it'd still be considered an issue, we might probably need a fix similar to : https://github.com/pandas-dev/pandas/pull/31918
@jorisvandenbossche hey so I noticed that in the issue you fixed in issue #31896, you said that parquet does not seem to be able to store uint32. Would you happen to know more about this issue and if this is an issue with pyarrow or pandas? thanks!
@phofl Hi phofl. I'm currently trying to debug this issue, but it seems like this might be an external problem with pyarrow. Here's some interesting output I get:
//The datatype uint32 is preserved here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='fastparquet') dataframe = pd.read_parquet('a', engine='fastparquet')
//The datatype uint32 is preserved here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='fastparquet') dataframe = pd.read_parquet('a', engine='pyarrow')
//The datatype uint32 is read in as a int64 here df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")}) df.to_parquet('a', engine='pyarrow') dataframe = pd.read_parquet('a', engine='fastparquet')
I feel like this means there's something wrong with how pyarrow writes uint32 to a file. I was wondering if you've had any suggestions? I've tried using the pandas metadata of the parquet file to just convert the dataframe back to uint32 after reading it in, but that made a lot of test cases fail.
Unfortunately I am not that familiar here. Do you know or could find out who implemented the pyarrow engine?
Sorry for the slow reply here. This is not directly related to #31896 (that was a bug on the conversion on our side, specifically for nullable dtypes), but it is actually a limiation of pyarrow
.
You can specify version="2.0"
, and then pyarrow will use additional type annotations in the parquet file, in which case it can actually preserve uint32. But by default it indeed does not.
So there is nothing to do on the pandas side about it (apart from maybe better documenting this). A similar issue about this on the pyarrow side is https://issues.apache.org/jira/browse/ARROW-9215
While googling for solving this very same problem, I found this very useful thread.
Let me just add that version="2.0"
is deprecated now, use instead "2.4"
or "2.6"
.
I understand that it may sound obvious, but maybe adding in the pandas docs an explicit link on where to look for the additional **kwargs
to be passed to the undelying engine could be useful. In fact it took me a while to figure out that the relevant docs for pyarrow are in pyarrow.parquet.ParquetWriter.
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
It appears to be that
uint32
is not being preserved likeuint64
is being preserved while round-tripping through a parquet file.Expected Output
Preserve the
uint32
dtype.Output of
pd.show_versions()
Crosslinking to cudf fuzz-testing for tracking purpose: https://github.com/rapidsai/cudf/issues/6001