[Data] Dataset.unique() raises error in case of any null values

bdewilde commented 9 months ago

What happened + What you expected to happen

I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.

Here are two versions of type error I got, seemingly from the same line of code:

File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

TypeError: '<' not supported between instances of 'NoneType' and 'int'

and

File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

Versions / Dependencies

macOS 14.1 PY 3.9 ray == 2.9.0 pandas == 2.1.0

Reproduction script

import pandas as pd
import ray.data

items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'

df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Akshi22 commented 7 months ago

Hello burton, I'd like to work on this issue! TIA.

bdewilde commented 7 months ago

hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there?

bdewilde commented 7 months ago

For what it's worth, I just ran into this issue again, only this time in the context of Dataset.groupby(col). It's the same error message, and presumably the same code under the hood. Just a bummer.

csking101 commented 2 months ago

Hi, is this issue still open? If so, I'd like to get started contributing to Ray.io!

ray-project / ray