Open raginjason opened 3 years ago
For other people running into a similar issue, I was able to work around this by mapping the object
type columns in question to StringDtype()
. Columns of this type appear to map to BINARY L:STRING
in Parquet regardless of their contents.
This still seems like a bug to me though. I can understand there may be a need to default types, but I don't see how INT32
is a reasonable default for the catch-all Pandas type of object
This is happening in our processes as well. We have some Decimal values that we map to objects but when we try to read those on Spark 3.0 it breaks our pipelines.
Hey @raginjason, I’m facing the same issue. Using StringDtype it writes correctly as BINARY, however my NULLs are being written as ‘None’
(i.e. the string representation), not a real NULL value. Did you face this as well? Wondering how you dealt with that. Thanks a lot!
I can confirm this is still happening using pandas 1.4.2
Confirming this is still happening in Pandas 2.2.2
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Once that runs,
parquet-tools
illustrates the issue. I would expect a datatype such asOPTIONAL BINARY L:STRING
, as in:But instead got a datatype of
OPTIONAL INT32
, as in:Problem description
Writing out a column with pandas type
object
with no values appears to create a parquet type ofINT32
when I would expect it to beBINARY L:STRING
or similar. I have a daily process that outputs a set of records to parquet and on days where there are no values in an object column the parquet datatype changes toINT32
, thus breaking my process as the schemas has changed relative to previous days.Expected Output
Output of
pd.show_versions()