Open wingkitlee0 opened 3 days ago
Can reproduce this behavior and I think current API doc didn't specify the exact behavior under None. I think its more reasonable to follow Pandas way of sort_values(by=None) which will simply raise an error, its more explicit to the pythonic way. @wingkitlee0 sorted_df = df.sort_values(by=None)
Thanks for the quick PR, but I haven't thought through raising error for None yet.
I think the option was meant to sort all columns.
At least one other place uses this "conceptually" is groupby(None)
, which allow grouping all columns.
I am also curious whether it was working before or not.
Thanks for the quick PR, but I haven't thought through raising error for None yet.
I think the option was meant to sort all columns.
At least one other place uses this "conceptually" is
groupby(None)
, which allow grouping all columns.I am also curious whether it was working before or not.
The Ray Data groupby docs seems to be broken in formatting https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.groupby.html, it doesn't clarify how it works.
But I assume that grouping all columns meaning grouping everything together right, essentially === doing nothing ? If we apply the same idea to sort, then it also means we do nothing when it is set to None, wdyt? Both are explicit instead of implicitly grouping/sorting based on some hidden factors. It might be counterintuitive for Pandas users if we do the opposite behavior against Pandas.
After reading a little bit in python/ray/data/grouped_data.py
(and dataset.py
), I think key=None
in groupby
was meant to share the aggregate
functions internally. Also, no sort(key=None)
is called. Maybe None
is unreasonable in sort() as you said.
Let's ping the ray team after the thanksgiving week..
Hey @wingkitlee0, I think at this point we don't support sort(None), and probably don't need to support it (for example, Pyarrow forces an explicit parameter, Pandas as well)
Okay, it sounds like it was never working.
we should remove the default value of None then? (and raise error, update doc, etc...)
What happened + What you expected to happen
Dataset.sort() has a default
key
value ofNone
, which should mean it sorts all columns. But it raisesIndexError
Also, the doc does not say what
None
does.Versions / Dependencies
Reproduction script
It also does not work for
sort(None)
Issue Severity
Medium: It is a significant difficulty but I can work around it.