pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.88k forks source link

ENH: allow sort_values to use the natural sort order #36286

Closed AlexeyGy closed 4 years ago

AlexeyGy commented 4 years ago

Is your feature request related to a problem?

The natural sort order is a common use case when working with real-world data. For example, consider the following DataFrame of clinical data where the body temperature of patients was measured:

data = {'Patient_ID': {0: 'ID-1',
  1: 'ID-11',
  2: 'ID-2'},
 'temperature': {0: 37.2, 1: 37.5, 2: 37.2}}
df = pd.DataFrame(data).sort_values(by=['Patient_ID'])
df.head(5)

will yield:

Patient_ID temperature
0 ID-1 37.2
1 ID-11 37.5
2 ID-2 37.2

whereas we would want

Patient_ID temperature
0 ID-1 37.2
2 ID-2 37.2
1 ID-11 37.5

Describe the solution you'd like

Since we are only adding a parameter this would not break any existing API.

Describe alternatives you've considered

Currently, one could use the natsort package. However, this seems cumbersome for such a common operation and makes it necessary to reindex the DataFrame. Stackoverflow example.

jreback commented 4 years ago

you can probably use the key= option here (happy to take a PR to show this in the docs)

-1 on adding any api

erfannariman commented 4 years ago

Is it oké to give an example in the docs using a third party package like natsort? @jreback

satrio-hw commented 4 years ago

hai, I solved the problem using key as @jreback said.. df = pd.DataFrame(data).sort_values(by=['Patient_ID'], key=lambda col:col.astype(str).str[3:].astype(int)) as it would take Patient_ID as string[3:] (so it will ignore 'ID-') and convert it back to int for sorting..

erfannariman commented 4 years ago

Here's a more general solution, which should work in all cases:

from natsort import index_natsorted

df.sort_values(
    by="col_name",
    key=lambda x: np.argsort(index_natsorted(df["col_name"]))
)

Or without the lambda:

def natural_sort(column):
    idx = index_natsorted(column)
    return np.argsort(idx)

df.sort_values(by="col_name", key=lambda x: natural_sort(df["col_name"]))

Which needs an installation first: pip install natsort

jreback commented 4 years ago

@erfannariman so happy to take a PR to add that as an example; it would be ok i think to add to environment.yml with an appropriate comment (used in doc-strings) as this environment builds the docs.

erfannariman commented 4 years ago

@jreback PR is ready for review, weird stata test failing, will check later.

jaredight commented 1 year ago

@erfannariman Could you update the docs to reflect your answer here?

from natsort import natsort_keygen

df.sort_values(
    by="time",
    key=natsort_keygen()
)