ropensci / spelling

Tools for Spell Checking in R
https://docs.ropensci.org/spelling
Other
105 stars 27 forks source link

Sort WORDLIST in a locale-independent way #48

Closed Bisaloo closed 4 years ago

Bisaloo commented 4 years ago

Currently, the word order in WORDLIST is locale-dependent, which can create large spurious diffs when multiple people contribute to the package but use different locales.

I see two solutions:

The nice thing about the second option is that you can set the locale to the one specified in DESCRIPTION.

Please let me know if you'd like me to submit a PR for this.

jeroen commented 4 years ago

Hi thanks for the suggestion. I'm not super familiar with locales can you give an example how the sorting is different?

I rather not changes locales because that has funky side effects in some systes (though C is usually safe).

Bisaloo commented 4 years ago

Here is a reprex:

test <- c(
  letters,
  LETTERS,
  "Hugo's words",
  "change"
)

library(withr)

with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test))
#>  [1] "a"            "A"            "b"            "B"            "c"           
#>  [6] "C"            "change"       "d"            "D"            "e"           
#> [11] "E"            "f"            "F"            "g"            "G"           
#> [16] "h"            "H"            "Hugo's words" "i"            "I"           
#> [21] "j"            "J"            "k"            "K"            "l"           
#> [26] "L"            "m"            "M"            "n"            "N"           
#> [31] "o"            "O"            "p"            "P"            "q"           
#> [36] "Q"            "r"            "R"            "s"            "S"           
#> [41] "t"            "T"            "u"            "U"            "v"           
#> [46] "V"            "w"            "W"            "x"            "X"           
#> [51] "y"            "Y"            "z"            "Z"

with_locale(c("LC_COLLATE" = "C"), sort(test))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test))
#>  [1] "a"            "A"            "b"            "B"            "c"           
#>  [6] "C"            "d"            "D"            "e"            "E"           
#> [11] "f"            "F"            "g"            "G"            "h"           
#> [16] "H"            "Hugo's words" "change"       "i"            "I"           
#> [21] "j"            "J"            "k"            "K"            "l"           
#> [26] "L"            "m"            "M"            "n"            "N"           
#> [31] "o"            "O"            "p"            "P"            "q"           
#> [36] "Q"            "r"            "R"            "s"            "S"           
#> [41] "t"            "T"            "u"            "U"            "v"           
#> [46] "V"            "w"            "W"            "x"            "X"           
#> [51] "y"            "Y"            "z"            "Z"

Created on 2020-04-03 by the reprex package (v0.3.0)

There are also plenty of other examples where it can go wrong because of diacritics in names but I can't find a good reprex for this right now.

For a simple real-life example, see https://github.com/ropensci/lightr/commit/515d193373b35d1faddc117c6606a97ca7a32c74#diff-89da0e7dae7c72fd9541f184b5112343L13-L15 where OceanOptics and O'Hanlon swapped positions. This is for a simple package but for larger ones, with long vignettes, it is more annoying.

EDIT: from ?Comparison:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.

Bisaloo commented 4 years ago

And if you think changing locales is a bad idea, method = "radix" is not as bad as I thought it would be. It pretty good even. I was expecting a somewhat random order.

test <- c(
  letters,
  LETTERS,
  "Hugo's words",
  "change"
)

library(withr)

with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "C"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

Created on 2020-04-03 by the reprex package (v0.3.0)

jeroen commented 4 years ago

Ok sounds good van you send a PR?