opensafely-core / r-docker

Docker image for running R scripts in OpenSAFELY
1 stars 3 forks source link

locale is set to nothing leading to different than expected sorting of vectors of character strings in R #99

Closed remlapmot closed 1 year ago

remlapmot commented 2 years ago

(sorry for the long report below and that it took me so long to spot this)

I'm sure by accident there the locale is set to nothing (is this referred to as unset? sorry I am not really familiar with locale grammar) in the r-docker container.

Unfortunately the sorting of character strings in I think the base R sort() and order() functions and hence dplyr::arrange() are all affected by locale.

Examples

docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.getenv('LANG')"
## [1] ""
% docker run --platform linux/amd64 --entrypoint /bin/bash -it ghcr.io/opensafely-core/r:latest

root@d446d1ea5760:/workspace# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

root@d446d1ea5760:/workspace# locale -a
C
C.UTF-8
POSIX
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "sort(c(head(letters), head(LETTERS)))"
##  [1] "A" "B" "C" "D" "E" "F" "a" "b" "c" "d" "e" "f"
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_US.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.setenv(LANG='en_GB.UTF-8'); \
      Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_GB.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Fixes

docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "stringr::str_sort(c(head(letters), head(LETTERS)))"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Once the locales have been generated the other fixes are

docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.setenv(LANG='en_GB.UTF-8'); sort(c(head(letters), head(LETTERS)))"
ENV LANG="en_GB.UTF-8"

In R see the locales and sort helpfiles for more info

?locales
?sort
inglesp commented 2 years ago

Hi @remlapmot -- to help us prioritise fixing this, can you let me know what impact this has on users and their research?

remlapmot commented 2 years ago

This doesn't have much impact @inglesp.

The only effect of this I have seen in OpenSAFELY repos is that graphs can be produced with categories in different orders compared to when the locale is set. So nothing incorrect is produced, but the plot has the x- or y-axis categories in a different order compared to what the user was expecting. This gives the user a few seconds of confusion, but they either ignore it or regenerate the plot locally to obtain the preferred/expected sort ordering.

It affects the sort ordering of vectors of character strings with certain functions. Unfortunately those are the functions users are most likely to use, i.e., sort() and order() from base R and dplyr::arrange() but it can be overcome, without setting the locale, by using stringr::str_sort()/stringr::str_order() instead - although this requires the user to remember this. You can see the description of the locale argument to those stringr functions here.

In the Rocker containers they set the locale to en_US.UTF-8, e.g. in r-ver here and in r-base e.g. here. You can see their original Rocker issue about this here.

I went into a few other OpenSAFELY and datalab containers interactively and I couldn't really find one where the locale has been set. The locales were either empty or C.UTF-8.

inglesp commented 2 years ago

Thanks Tom. I expect we'll come to this when we revisit how the R docker image is built.

remlapmot commented 2 years ago

That's great.

Since I made the reverse-engineer fork of r-docker I have been rebasing it everytime the container has been updated (only 3 times since I made it) - so it's still up-to-date.

https://github.com/opensafely/reverse-engineer-r-docker/branches

I have made 4 active branches with slightly different approaches. Two of these are more stable, these are the 2 with names ending -fixed-urls because they use the locked package data URLs from RSPM.

https://packagemanager.rstudio.com/client/#/repos/1/overview

rspm-2

I had also been thinking about making another branch just using MRAN dated snapshot repos - but then all the packages would have to be built from source and so the Dockerfile would probably take several hours to build. The existing 4 branches take approx 30 mins to build because they download prebuilt Linux binary packages from RSPM.

remlapmot commented 1 year ago

Closing as fixed by #123 .