rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.25k stars 883 forks source link

[BUG] `cudf.testing.assert_*_equal` raises AssertionError for equivalent `DecimalDtype`d objects #16635

Open mroeschke opened 3 weeks ago

mroeschke commented 3 weeks ago

Describe the bug

In [1]: import cudf

In [2]: ser = cudf.Series([1], dtype=cudf.Decimal128Dtype(1))

In [3]: cudf.testing.assert_series_equal(ser, ser)

AssertionError: ColumnBase are different

values are different (100.0 %)
[left]:  {"[Decimal('1')]"}
[right]: {"[Decimal('1')]"}

Expected behavior I would expect no AssertionError.

It appears there's a testing function, dtype_can_compare_equal_to_other, used in column comparisons that over-zealously assumes two objects with DecimalDtypes shouldn't be compared to each other.

Environment overview (please complete the following information)

AntiKnot commented 3 weeks ago

hi @mroeschke

Based on change history

The changes introduce type checks on DecimalDtype that are not necessary to fix the bug,I think it's over-zealously.

AntiKnot commented 3 weeks ago

Hypothesis

cupy does not fully implement numpy's asarray method, at least dtype does not support Decimal128Dtype

Reproduce

I try to remove cudf.core.dtypes.DecimalDtype, in fun dtype_can_compare_equal_to_other, so Decimal128Dtype as a numeric dtype and can compare equal to other type.

def assert_column_equal(
...
                    left.apply_boolean_mask(
                        left.isnull().unary_operator("not")
                    ).values,
...

cudf/cudf/core/column/column.py

    @property
    def values(self) -> cupy.ndarray:
        """
        Return a CuPy representation of the Column.
        """
        if len(self) == 0:
            return cupy.array([], dtype=self.dtype)

        if self.has_nulls():
            raise ValueError("Column must have no nulls.")

        return cupy.asarray(self.data_array_view(mode="write"))

will raise

TypeError: Cannot interpret 'Decimal128Dtype(precision=1, scale=0)' as a data type

Reproduce the code example:

import cudf
ser = cudf.Series([1], dtype=cudf.Decimal128Dtype(1))
left = ser._column
left.apply_boolean_mask(left.isnull().unary_operator("not")).values

if numpy

import numpy
obj = left.apply_boolean_mask(left.isnull().unary_operator("not"))
numpy.asarray(obj)
Out[11]:
array(<cudf.core.column.decimal.Decimal128Column object at 0x726ea7de4f70>
[
  1
]
dtype: decimal128, dtype=object)

if cupy

import cupy
cupy.asarray(obj)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 cupy.asarray(obj)

File ~/Code/cudf/.venv/lib/python3.10/site-packages/cupy/_creation/from_data.py:88, in asarray(a, dtype, order, blocking)
     56 def asarray(a, dtype=None, order=None, *, blocking=False):
     57     """Converts an object to array.
     58
     59     This is equivalent to ``array(a, dtype, copy=False, order=order)``.
   (...)
     86
     87     """
---> 88     return _core.array(a, dtype, False, order, blocking=blocking)

File cupy/_core/core.pyx:2408, in cupy._core.core.array()

File cupy/_core/core.pyx:2435, in cupy._core.core.array()

File cupy/_core/core.pyx:2574, in cupy._core.core._array_default()

ValueError: Unsupported dtype object
mroeschke commented 3 weeks ago

We'll first need to assert that the dtypes are equivalent then probably use pandas assertion functions instead of cupy/numpy for comparing decimal values