tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.79k stars 2.12k forks source link

Modify how arrange sorts strings #7044

Closed prubin73 closed 5 months ago

prubin73 commented 5 months ago

When sorting a data frame/tibble based on a character column, arrange uses a different sort order than what is used by sort and by most (all?) spreadsheet programs. This creates issues when working on data coming from/going to a spreadsheet. Interestingly, use of the desc function within arrange switches the sort order to conform to sort and the spreadsheets.

# Demonstrate sorting discrepancy between `arrange` and `sort`.

# Create sample data. The second column is just to ensure that sorting does not
# convert a data frame into a vector.
df <- data.frame(Label = c("bama", "mama", "1000x", "BAnn", "10:00x"), Index = 1:5)

# Sort the rows into ascending label order using `dplyr::arrange`.
df |> dplyr::arrange(Label) |> print()
#>    Label Index
#> 1  1000x     3
#> 2 10:00x     5
#> 3   BAnn     4
#> 4   bama     1
#> 5   mama     2

# Sort the rows into ascending label order using `sort`.
df[sort(df$Label, index.return = TRUE)$ix, ] |> print()
#>    Label Index
#> 5 10:00x     5
#> 3  1000x     3
#> 1   bama     1
#> 4   BAnn     4
#> 2   mama     2

# Sort with `arrange` in "not descending" order.
df |> dplyr::arrange(-dplyr::desc(Label)) |> print()
#>    Label Index
#> 1 10:00x     5
#> 2  1000x     3
#> 3   bama     1
#> 4   BAnn     4
#> 5   mama     2
DavisVaughan commented 5 months ago

This is intended, it uses the C locale by default. See the .locale argument https://dplyr.tidyverse.org/reference/arrange.html

You probably want to specify .locale = "en"

prubin73 commented 5 months ago

Thanks. I wondered if locale was an issue, but failed to read the fine print. (I assume it would use the operating system's default locale.) It's interesting that arrange defaults to the C locale but desc apparently does not.

DavisVaughan commented 5 months ago

Yea that's a good point, I'll open another issue about that in particular