[data] Default value of `key` in sort(key=None) raises IndexError

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

34.21k stars 5.81k forks source link

[data] Default value of `key` in sort(key=None) raises IndexError #48926

Open wingkitlee0 opened 3 days ago

wingkitlee0 commented 3 days ago

What happened + What you expected to happen

Dataset.sort() has a default key value of None, which should mean it sorts all columns. But it raises IndexError

IndexError: list index out of range

Also, the doc does not say what None does.

Versions / Dependencies

In [29]: ray.__version__
Out[29]: '2.39.0'

In [31]: np.__version__
Out[31]: '1.26.2'

Reproduction script

In [1]: import ray

In [2]: import pandas as pd

In [3]: df = pd.DataFrame([[1, 2], [4, 5], [6,7]], columns=["x", "y"])

In [4]: ds = ray.data.from_pandas(df)
2024-11-27 15:42:44,110 INFO worker.py:1807 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

In [5]: ds.sort().take_all()

It also does not work for sort(None)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Superskyyy commented 1 day ago

Can reproduce this behavior and I think current API doc didn't specify the exact behavior under None. I think its more reasonable to follow Pandas way of sort_values(by=None) which will simply raise an error, its more explicit to the pythonic way. @wingkitlee0 sorted_df = df.sort_values(by=None)

wingkitlee0 commented 1 day ago

Thanks for the quick PR, but I haven't thought through raising error for None yet.

I think the option was meant to sort all columns.

At least one other place uses this "conceptually" is groupby(None), which allow grouping all columns.

I am also curious whether it was working before or not.

Superskyyy commented 1 day ago

Thanks for the quick PR, but I haven't thought through raising error for None yet.

I think the option was meant to sort all columns.

At least one other place uses this "conceptually" is groupby(None), which allow grouping all columns.

I am also curious whether it was working before or not.

The Ray Data groupby docs seems to be broken in formatting https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.groupby.html, it doesn't clarify how it works.

But I assume that grouping all columns meaning grouping everything together right, essentially === doing nothing ? If we apply the same idea to sort, then it also means we do nothing when it is set to None, wdyt? Both are explicit instead of implicitly grouping/sorting based on some hidden factors. It might be counterintuitive for Pandas users if we do the opposite behavior against Pandas.

wingkitlee0 commented 1 day ago

After reading a little bit in python/ray/data/grouped_data.py (and dataset.py), I think key=None in groupby was meant to share the aggregate functions internally. Also, no sort(key=None) is called. Maybe None is unreasonable in sort() as you said.

Let's ping the ray team after the thanksgiving week..

richardliaw commented 1 day ago

Hey @wingkitlee0, I think at this point we don't support sort(None), and probably don't need to support it (for example, Pyarrow forces an explicit parameter, Pandas as well)

wingkitlee0 commented 1 day ago

Okay, it sounds like it was never working.

we should remove the default value of None then? (and raise error, update doc, etc...)