Closed Bisaloo closed 4 years ago
Hi thanks for the suggestion. I'm not super familiar with locales can you give an example how the sorting is different?
I rather not changes locales because that has funky side effects in some systes (though C
is usually safe).
Here is a reprex:
test <- c(
letters,
LETTERS,
"Hugo's words",
"change"
)
library(withr)
with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test))
#> [1] "a" "A" "b" "B" "c"
#> [6] "C" "change" "d" "D" "e"
#> [11] "E" "f" "F" "g" "G"
#> [16] "h" "H" "Hugo's words" "i" "I"
#> [21] "j" "J" "k" "K" "l"
#> [26] "L" "m" "M" "n" "N"
#> [31] "o" "O" "p" "P" "q"
#> [36] "Q" "r" "R" "s" "S"
#> [41] "t" "T" "u" "U" "v"
#> [46] "V" "w" "W" "x" "X"
#> [51] "y" "Y" "z" "Z"
with_locale(c("LC_COLLATE" = "C"), sort(test))
#> [1] "A" "B" "C" "D" "E"
#> [6] "F" "G" "H" "Hugo's words" "I"
#> [11] "J" "K" "L" "M" "N"
#> [16] "O" "P" "Q" "R" "S"
#> [21] "T" "U" "V" "W" "X"
#> [26] "Y" "Z" "a" "b" "c"
#> [31] "change" "d" "e" "f" "g"
#> [36] "h" "i" "j" "k" "l"
#> [41] "m" "n" "o" "p" "q"
#> [46] "r" "s" "t" "u" "v"
#> [51] "w" "x" "y" "z"
with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test))
#> [1] "a" "A" "b" "B" "c"
#> [6] "C" "d" "D" "e" "E"
#> [11] "f" "F" "g" "G" "h"
#> [16] "H" "Hugo's words" "change" "i" "I"
#> [21] "j" "J" "k" "K" "l"
#> [26] "L" "m" "M" "n" "N"
#> [31] "o" "O" "p" "P" "q"
#> [36] "Q" "r" "R" "s" "S"
#> [41] "t" "T" "u" "U" "v"
#> [46] "V" "w" "W" "x" "X"
#> [51] "y" "Y" "z" "Z"
Created on 2020-04-03 by the reprex package (v0.3.0)
There are also plenty of other examples where it can go wrong because of diacritics in names but I can't find a good reprex for this right now.
For a simple real-life example, see https://github.com/ropensci/lightr/commit/515d193373b35d1faddc117c6606a97ca7a32c74#diff-89da0e7dae7c72fd9541f184b5112343L13-L15 where OceanOptics
and O'Hanlon
swapped positions. This is for a simple package but for larger ones, with long vignettes, it is more annoying.
EDIT: from ?Comparison
:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
And if you think changing locales is a bad idea, method = "radix"
is not as bad as I thought it would be. It pretty good even. I was expecting a somewhat random order.
test <- c(
letters,
LETTERS,
"Hugo's words",
"change"
)
library(withr)
with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test, method = "radix"))
#> [1] "A" "B" "C" "D" "E"
#> [6] "F" "G" "H" "Hugo's words" "I"
#> [11] "J" "K" "L" "M" "N"
#> [16] "O" "P" "Q" "R" "S"
#> [21] "T" "U" "V" "W" "X"
#> [26] "Y" "Z" "a" "b" "c"
#> [31] "change" "d" "e" "f" "g"
#> [36] "h" "i" "j" "k" "l"
#> [41] "m" "n" "o" "p" "q"
#> [46] "r" "s" "t" "u" "v"
#> [51] "w" "x" "y" "z"
with_locale(c("LC_COLLATE" = "C"), sort(test, method = "radix"))
#> [1] "A" "B" "C" "D" "E"
#> [6] "F" "G" "H" "Hugo's words" "I"
#> [11] "J" "K" "L" "M" "N"
#> [16] "O" "P" "Q" "R" "S"
#> [21] "T" "U" "V" "W" "X"
#> [26] "Y" "Z" "a" "b" "c"
#> [31] "change" "d" "e" "f" "g"
#> [36] "h" "i" "j" "k" "l"
#> [41] "m" "n" "o" "p" "q"
#> [46] "r" "s" "t" "u" "v"
#> [51] "w" "x" "y" "z"
with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test, method = "radix"))
#> [1] "A" "B" "C" "D" "E"
#> [6] "F" "G" "H" "Hugo's words" "I"
#> [11] "J" "K" "L" "M" "N"
#> [16] "O" "P" "Q" "R" "S"
#> [21] "T" "U" "V" "W" "X"
#> [26] "Y" "Z" "a" "b" "c"
#> [31] "change" "d" "e" "f" "g"
#> [36] "h" "i" "j" "k" "l"
#> [41] "m" "n" "o" "p" "q"
#> [46] "r" "s" "t" "u" "v"
#> [51] "w" "x" "y" "z"
Created on 2020-04-03 by the reprex package (v0.3.0)
Ok sounds good van you send a PR?
Currently, the word order in
WORDLIST
is locale-dependent, which can create large spurious diffs when multiple people contribute to the package but use different locales.I see two solutions:
method = "radix"
insort()
. It is to my knowledge the only locale independent sorting methodThe nice thing about the second option is that you can set the locale to the one specified in
DESCRIPTION
.Please let me know if you'd like me to submit a PR for this.