tidyverse / tidyups

21 stars 5 forks source link

Revise 003 to reflect that `dplyr_locale()` will default to the C locale #26

Closed DavisVaughan closed 2 years ago

DavisVaughan commented 2 years ago

This revises the 003 tidyup to reflect that we now think arrange() should default to the C locale, rather than "en" if stringi is installed with a noisy fallback to C if not.

This mainly comes from a realization that the current behavior would be very frustrating for package developers that rely on dplyr. If they call arrange() at all in their package, then that essentially makes stringi an Imports dependency in their package, even if they aren't sorting character vectors, since it would warn unconditionally about falling back to the C locale if stringi wasn't there. This could be extremely annoying when developing, since you might have stringi locally on your computer but it may not be installed on CI.

Defaulting to C works everywhere all the time, and if a developer explicitly sets .locale = "en" then that is a clear signal that they are opting into optional behavior that requires stringi, so they should Import it.

The main sections with changes are:

If you want to reread the updated tidyup without the diffs, you can go here https://github.com/tidyverse/tidyups/blob/22ad630c7576ae0b5b7627416240c66ff385e395/003-dplyr-radix-ordering.md

Corresponding dplyr PR https://github.com/tidyverse/dplyr/pull/6263

DavisVaughan commented 2 years ago

@markfairbanks @yutannihilation @mgirlich we'd appreciate if you could comment on whether or not this tidyup proposal still sounds good to you with this change. We are finally implementing these updates!

@markfairbanks and @mgirlich I have a feeling both of you will probably be happy with this, because defaulting to the C locale means that you will get much faster character sorting "out of the box" with dplyr. And since group_by() will also be changed to unconditionally use the C locale for sorting internally, this change will play nicely with that. Expert users can arrange() once up front by the grouping variables, which places groups close together in memory, and then any group_by() calls after that will be extremely fast (because the groups are already close together).

markfairbanks commented 2 years ago

The consideration for package development makes perfect sense, so this seems like a good idea to me 👍

And, as you mentioned, the performance is a big benefit as well.

yutannihilation commented 2 years ago

Sounds great to me!

mgirlich commented 2 years ago

This sounds great to me 👍