API: unified sorting - Githubissues

patricktokeeffe commented 10 years ago

originally #5190 xref #9816 xref #3942

This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort (except Series version retains current inplace=True):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None

Proposed changes:

~~make inplace=False default (changes Series.sort)~~ maybe, possibly in 1.0
new by argument to accept column-name/list-of-column-names in first position
- deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
- don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
- don't swap order of by/axis in DataFrame.sort_index (see change 7)
- this argument is ignored by series but axis is too so for the sake of working with dataframes, it gets first position
new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
- replaces tuple behavior of columns arg of DataFrame.sort
- add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
- also adds sort_remaining arg to handle multi-level indexes
new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
deprecate Series.order since change 1 makes Series.sort equivalent (?)
add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

default behavior of sort is still object-dependent: for series, sorts by values and for data frames, sorts by index
new level arg makes sort_index and sortlevel equivalent. if sortlevel is retained:
- should rename sortlevel to sort_level for naming conventions
- Series.sortlevel should have inplace argument added
- maybe don't add level and sort_remaining args to sort_index so it's not equivalent to sort_level (intentionally limiting sort_index seems like a bad idea though)
it's unclear if default should be level=None for sort_columns. probably not since level=None falls back to level=0 anyway
both by and axis arguments should be ignored by Series.sort

Syntax:

dataframes
- sort() == sort(level=0) == sort_index() == sortlevel()
- without columns or level specified, defaults to current behavior of sort on index
- sort(['A','B'])
- since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
- sort(level='spam') == sort_index('spam') == sortlevel('spam')
- sort occurs on row index named 'spam' or level of multi-index named 'spam'
- sort(['A','B'], level='spam')
- level controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
- sort(axis=1) == sort(axis=1, level=0) == sort_columns()
- since data frames default to sort on index, leaving level=None is the same as level=0
- sort(['A','B'], axis=1) == sort_columns(['A','B'])
- as with preceding example, level=None becomes level=0 in sort_columns
- sort(['A','B'], axis=1, level='spam') == sort_columns(['A','B'], level='spam')
- axis controls level so sort will be on columns named 'A' and 'B' in column index named 'spam'
series:
- sort() == order() -- sorts on values
- with level specified, sorts on index/named index/level of multi-index:
- sort(level=0) == sort_index() == sortlevel()
- sort(level='spam') == sort_index('spam') == sortlevel('spam')

Comments welcome.

hayd commented 10 years ago

make inplace=False default (changes Series.sort)

-1. Although I'm not fan of this behaviour, IMO we're stuck with it*, this would break lots of code and there's no way to make a clean depreciation/migration path.

+1 to cleaning up sort API, the other changes seem reasonable.

*Perhaps we should name everything order ?? :S

patricktokeeffe commented 10 years ago

Regarding inplace=False as new default: could a scary warning be added to 0.15 without changing the current behavior and then in 0.16 or 0.17 make the switch?

I tried to come up with a way to detect when users might expect the original behavior... couldn't find anything clean. The best one was probably:

def sort( ..., inplace=None):
    ...
    inplace = True if inplace is None else False 
    ...

Unfortunately it's still a pretty bad idea. Backward-compatible, sure, but it's really just kicking the can down the road.

hayd commented 10 years ago

I just don't see the value proposition in this particular breakage, it will affect a lot of users, and you're not even "fixing" anything (i.e. fixing their buggy code) - you'd just be changing syntax. To quote @y-p:

if a 1000 people need to modify working code to accomodate this change, is it still worth doing?

I'd say I was on the liberal side of API breakages, but I don't see how this one can fly! :rage1:

patricktokeeffe commented 10 years ago

The more unified the sort API becomes, the more glaring the inplace inconsistency will be. That said, I think the argument is stronger to have consistent behavior vs. consistent signatures. Such a change should wait for a major version bump. (wait, who said 1.0?)

So keeping inplace=false for Series.sort means:

DataFrame.sort and Series.sort have (slightly) different signatures -- no biggie: docstrings
Series.sort would not become equivalent to Series.order -- but we still deprecate order? (I lean yes but maybe it's quite popular)
should new inplace keyword to Series.sort_index be False (to match df sorts) or True (to match s.sort)? I think False is correct so s.sort is the isolated case
if it can be agreed to change s.sort inplace in 1.0, should a warning be given?

hayd commented 10 years ago

if it can be agreed to change s.sort inplace in 1.0

big if! Definitely such a change should be discussed in the ML, but I think it's a tough sell.

I agree the inconsistency sucks, but practicality beats purity.... and this will (fairly) annoy a lot of people. I think if you're changing the API there needs to be some carrot cake rather than just stick (with this change I just see stick).

I was being serious about using/preferring .order, that route has the benefit of not being an API change and not differing in behaviour from both numpy and python itself (lists). Better long-term?

Edit: To me "sort" sounds inplace, whereas "order" is temporary arrangement.

patricktokeeffe commented 10 years ago

OK, that edit-note makes sense to me. I'll have a look at order and come back with some revisions.

jorisvandenbossche commented 9 years ago

@patricktokeeffe see also #9816

jorisvandenbossche commented 9 years ago

don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead

I don't think this is equivalent. columns with a tuple is to be able to specify to sort on a specific column when having mult-indexed columns, while level is to specify a certain level of the multi-index index to sort by (when axis=0).

patricktokeeffe commented 9 years ago

There really isn't a good compromise here; at least, I didn't find one. But #9816 has a good solution, see #10726.

pandas-dev / pandas

API: unified sorting #8239