Closed galipremsagar closed 3 years ago
Here are my observations:
The only difference between the extra metadata of the files written by cudf and pandas is that the one written by cudf has
"numpy_type": "float64"
while the one written by pandas has
"numpy_type": "Float64"
When I added a correction code to utils.pyx:generate_pandas_metadata
:
if col_meta["numpy_type"] in ("float64"):
col_meta["numpy_type"] = "Float64"
it fixed this issue.
The culprit is pa.pandas_compat.construct_metadata
Another area in cudf that uses this pyarrow API also shows the same behaviour:
In [37]: gdf.to_arrow().to_pandas()
Out[37]:
a
0 1.0
1 NaN
2 NaN
The difference is in our dtypes. Pandas uses its own Float64Dtype
for its numerical columns
In [5]: pdf.a.dtype
Out[5]: Float64Dtype()
In [6]: str(pdf.a.dtype)
Out[6]: 'Float64'
In [12]: type(pdf.a.dtype)
Out[12]: pandas.core.arrays.floating.Float64Dtype
that wraps an np dtype
@register_extension_dtype
class Float64Dtype(FloatingDtype):
type = np.float64
name = "Float64"
__doc__ = _dtype_docstring.format(dtype="float64")
We directly use the np dtype for our numerical columns
In [9]: gdf.a.dtype
Out[9]: dtype('float64')
In [10]: str(gdf.a.dtype)
Out[10]: 'float64'
In [13]: type(gdf.a.dtype)
Out[13]: numpy.dtype
When generating pandas metadata, pyarrow uses str(column.dtype)
to generate the aforementioned field.
Pandas used to also use numpy dtype for it's columns until v0.24 when they added null support. Here's the docs from pandas where it explains that the new type used for nullable columns is an "Extension type". Notably the difference between this and the underlying numpy type:
Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype
Here are my observations: The only difference between the extra metadata of the files written by cudf and pandas is that the one written by cudf has
"numpy_type": "float64"
while the one written by pandas has"numpy_type": "Float64"
When I added a correction code to
utils.pyx:generate_pandas_metadata
:if col_meta["numpy_type"] in ("float64"): col_meta["numpy_type"] = "Float64"
it fixed this issue.
This looks like a reasonable fix to me, I don't see any downsides to doing this. Are there any that I'm missing?
This looks like a reasonable fix to me, I don't see any downsides to doing this. Are there any that I'm missing?
It fixes the symptom but not the issue. I filed #8707 to explain why we should use a better dtype than np.float64 for a nullable float column.
Describe the bug This looks like a parquet writer bug. When there is a mix of
np.nan
&<NA>
values in a float column, and that is written to parquet file, we are able to retrieve it correctly fromcudf
but not inpandas
. Butpandas
is able to write this column data correctly to a parquet file and that can be read fromcudf
&pandas
correctly.Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior I'd expect the cudf written parquet file (i.e.,
cudf.parquet
) to be able to behave similar topandas.parquet
file when read by bothcudf
&pandas
backends.Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Add any other context about the problem here.