Closed patricktokeeffe closed 9 years ago
make inplace=False default (changes Series.sort)
-1. Although I'm not fan of this behaviour, IMO we're stuck with it*, this would break lots of code and there's no way to make a clean depreciation/migration path.
+1 to cleaning up sort API, the other changes seem reasonable.
*Perhaps we should name everything order ?? :S
Regarding inplace=False
as new default: could a scary warning be added to 0.15 without changing the current behavior and then in 0.16 or 0.17 make the switch?
I tried to come up with a way to detect when users might expect the original behavior... couldn't find anything clean. The best one was probably:
def sort( ..., inplace=None):
...
inplace = True if inplace is None else False
...
Unfortunately it's still a pretty bad idea. Backward-compatible, sure, but it's really just kicking the can down the road.
I just don't see the value proposition in this particular breakage, it will affect a lot of users, and you're not even "fixing" anything (i.e. fixing their buggy code) - you'd just be changing syntax. To quote @y-p:
if a 1000 people need to modify working code to accomodate this change, is it still worth doing?
I'd say I was on the liberal side of API breakages, but I don't see how this one can fly! :rage1:
The more unified the sort API becomes, the more glaring the inplace inconsistency will be. That said, I think the argument is stronger to have consistent behavior vs. consistent signatures. Such a change should wait for a major version bump. (wait, who said 1.0?)
So keeping inplace=false for Series.sort means:
DataFrame.sort
and Series.sort
have (slightly) different signatures -- no biggie: docstringsSeries.sort
would not become equivalent to Series.order
-- but we still deprecate order? (I lean yes but maybe it's quite popular)inplace
keyword to Series.sort_index
be False (to match df sorts) or True (to match s.sort)? I think False is correct so s.sort is the isolated caseif it can be agreed to change s.sort inplace in 1.0
big if! Definitely such a change should be discussed in the ML, but I think it's a tough sell.
I agree the inconsistency sucks, but practicality beats purity.... and this will (fairly) annoy a lot of people. I think if you're changing the API there needs to be some carrot cake rather than just stick (with this change I just see stick).
I was being serious about using/preferring .order
, that route has the benefit of not being an API change and not differing in behaviour from both numpy and python itself (lists). Better long-term?
Edit: To me "sort" sounds inplace, whereas "order" is temporary arrangement.
OK, that edit-note makes sense to me. I'll have a look at order
and come back with some revisions.
@patricktokeeffe see also #9816
don't allow tuples to access levels of multi-index (
columns
arg ofDataFrame.sort
allows tuples); use newlevel
argument instead
I don't think this is equivalent. columns
with a tuple is to be able to specify to sort on a specific column when having mult-indexed columns, while level
is to specify a certain level of the multi-index index to sort by (when axis=0).
There really isn't a good compromise here; at least, I didn't find one. But #9816 has a good solution, see #10726.
originally #5190 xref #9816 xref #3942
This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.
For reference, the 0.14.1 signatures are:
Proposed unified signature for
Series.sort
andDataFrame.sort
(except Series version retains current inplace=True):The
sort_index
signatures change too andsort_columns
is created:Proposed changes:
makemaybe, possibly in 1.0inplace=False
default (changesSeries.sort
)by
argument to accept column-name/list-of-column-names in first positioncolumns
keyword ofDataFrame.sort
, replaced withby
(df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)columns
arg ofDataFrame.sort
allows tuples); use newlevel
argument insteadby
/axis
inDataFrame.sort_index
(see change 7)axis
is too so for the sake of working with dataframes, it gets first positionlevel
argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)columns
arg ofDataFrame.sort
level
argument tosort_index
in first position so level(s) of multilevel index can be specified; this makessort_index
==sortlevel
(see change 8)sort_remaining
arg to handle multi-level indexesDataFrame.sort_columns
==sort(axis=1)
(see syntax below)Series.order
since change 1 makesSeries.sort
equivalent (?)inplace
,kind
, andna_position
arguments toSeries.sort_index
(to matchDataFrame.sort_index
);by
andaxis
args are not added since they don't make sense for seriesby
argument fromDataFrame.sort_index
since it makessort_index
equivalent tosort
sortlevel
since change 3b makessort_index
equivalentNotes:
sort
is still object-dependent: for series, sorts by values and for data frames, sorts by indexlevel
arg makessort_index
andsortlevel
equivalent. if sortlevel is retained:sortlevel
tosort_level
for naming conventionsSeries.sortlevel
should haveinplace
argument addedlevel
andsort_remaining
args tosort_index
so it's not equivalent tosort_level
(intentionally limiting sort_index seems like a bad idea though)level=None
forsort_columns
. probably not since level=None falls back to level=0 anywayby
andaxis
arguments should be ignored bySeries.sort
Syntax:
sort()
==sort(level=0)
==sort_index()
==sortlevel()
sort(['A','B'])
sort(level='spam')
==sort_index('spam')
==sortlevel('spam')
sort(['A','B'], level='spam')
level
controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'sort(axis=1)
==sort(axis=1, level=0)
==sort_columns()
sort(['A','B'], axis=1)
==sort_columns(['A','B'])
sort(['A','B'], axis=1, level='spam')
==sort_columns(['A','B'], level='spam')
axis
controlslevel
so sort will be on columns named 'A' and 'B' in column index named 'spam'sort()
==order()
-- sorts on valueslevel
specified, sorts on index/named index/level of multi-index:sort(level=0)
==sort_index()
==sortlevel()
sort(level='spam')
==sort_index('spam')
==sortlevel('spam')
Comments welcome.