randy3k / radian

A 21 century R console
MIT License
1.96k stars 73 forks source link

Serialisation is different in radian to R #430

Open kendonB opened 10 months ago

kendonB commented 10 months ago

See here for background.

When running serializeVersion = 3L, radian gives a different result to regular R.

r$> callr::r(function() digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L))
[1] "051aee0c8529378c027b69f4bfcfa88a"

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"
randy3k commented 10 months ago

What version of R do you have? I cannot reproduce it on my computer

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Platform: x86_64-apple-darwin20 (64-bit)

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "051aee0c8529378c027b69f4bfcfa88a"

I have digest version 0.6.31.

kendonB commented 10 months ago
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Platform: x86_64-w64-mingw32 (64-bit)

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

r$> packageVersion("digest")
[1] '0.6.33'
r$> Sys.info()
       sysname        release        version       machine          
     "Windows"       "10 x64"  "build 22621"      "x86-64"    

I can't reproduce on my linux or WSL systems

randy3k commented 10 months ago

Would you try?

b <- serialize(mtcars, connection = NULL, version = 3L)
digest::digest(b, serialize = FALSE, skip = 14)

it should give the same results as

digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)

If they give different results, could you share the lengths of bs and the first few bytes it?

kendonB commented 10 months ago

Same results

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

r$>

r$> b <- serialize(mtcars, connection = NULL, version = 3L)
    digest::digest(b, serialize = FALSE, skip = 14)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"
kendonB commented 10 months ago

Is this a clue? radian:

r$> length(serialize(mtcars, connection = NULL, version = 3L))
[1] 3808

Rterm:

> length(serialize(mtcars, connection = NULL, version = 3L))
[1] 3807
randy3k commented 10 months ago
b <- serialize(mtcars, connection = NULL, version = 3L)
digest::digest(b, serialize = FALSE, skip = 14)

gives the same results in both Rterm and radian? It is a bit odd since the length of b are different.

Could you also report the following on both Rterm and radian?

Sys.getenv("LANG")
Sys.getlocale()
l10n_info()
randy3k commented 10 months ago

For some reason, I cannot reproduce it on Windows 11 running on a virtual machine. Did you try a newer version of radian?

kendonB commented 10 months ago

gives the same results in both Rterm and radian? It is a bit odd since the length of b are different.

No, those differ. They're the same from bit 23 for Rterm / 24 for radian:

Rterm:

> skip_base <- 21
> digest::digest(serialize(mtcars, connection = NULL, version = 3L), serialize = FALSE, skip = skip_base)
[1] "c85aa57f14d5e067930bf841688b5477"
> skip_base <- 22
> digest::digest(serialize(mtcars, connection = NULL, version = 3L), serialize = FALSE, skip = skip_base)
[1] "d3bcef08916358ff8885d327b564425b"

> serialize(mtcars, connection = NULL, version = 3L)[1:30]
 [1] 58 0a 00 00 00 03 00 04 03 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 03 13 00 00 00
# radian
r$> skip_base <- 21
    digest::digest(serialize(mtcars, connection = NULL, version = 3L), serialize = FALSE, skip = skip_base + 1)
[1] "1ff1d90cff7b8842217d4b0dd62d785a"

r$> skip_base <- 22
    digest::digest(serialize(mtcars, connection = NULL, version = 3L), serialize = FALSE, skip = skip_base + 1)
[1] "d3a7b100f59f32e8719a9706ac8154f3"

r$> serialize(mtcars, connection = NULL, version = 3L)[1:30]
 [1] 58 0a 00 00 00 03 00 04 03 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 03 13 00 00

Rterm:

> Sys.getenv("LANG")
[1] "en_US.UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=English_New Zealand.utf8;LC_CTYPE=English_New Zealand.utf8;LC_MONETARY=English_New Zealand.utf8;LC_NUMERIC=C;LC_TIME=English_New Zealand.utf8"
> l10n_info()
$MBCS
[1] TRUE

$`UTF-8`
[1] TRUE

$`Latin-1`
[1] FALSE

$codepage
[1] 65001

$system.codepage
[1] 65001

radian:

r$> Sys.getenv("LANG")
    Sys.getlocale()
    l10n_info()
[1] "en_US.UTF-8"
[1] "LC_COLLATE=English_New Zealand.1252;LC_CTYPE=English_New Zealand.1252;LC_MONETARY=English_New Zealand.1252;LC_NUMERIC=C;LC_TIME=English_New Zealand.1252"
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252

$system.codepage
[1] 1252
kendonB commented 10 months ago

I have:

PS C:\Users\KennyBell> radian --version
radian version: 0.6.5
r executable: C:\PROGRA~1\R\R-43~1.1\bin\R
r version: 4.3.1
python executable: C:\Users\KennyBell\anaconda3\python.exe
python version: 3.10.9
kendonB commented 10 months ago

Still the same on the latest version:

PS C:\Users\KennyBell> radian --version
radian version: 0.6.6
r executable: C:\PROGRA~1\R\R-43~1.1\bin\R
r version: 4.3.1
python executable: C:\Users\KennyBell\anaconda3\python.exe
python version: 3.10.9
PS C:\Users\KennyBell> radian
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Platform: x86_64-w64-mingw32 (64-bit)

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

I wonder if it's anaconda messing with something

randy3k commented 10 months ago

I think I have figured it out. it is a locale thing. It is very tricky to get the get locale set right since python doesn't support native utf-8 codepage, see https://github.com/randy3k/radian/issues/269 We will need to "force" python to use utf-8 codepage.

r$> Sys.setlocale(locale = "English_New Zealand.1252")
[1] "LC_COLLATE=English_New Zealand.1252;LC_CTYPE=English_New Zealand.1252;LC_M"

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

r$> Sys.setlocale(locale = "English_New Zealand.utf8")
[1] "LC_COLLATE=English_New Zealand.utf8;LC_CTYPE=English_New Zealand.utf8;LC_M"
Warning message:
In Sys.setlocale(locale = "English_New Zealand.utf8") :
  using locale code page other than 1252 may cause problems

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "051aee0c8529378c027b69f4bfcfa88a"
kendonB commented 10 months ago

can reproduce using a standard python install on Windows 11:

(base) PS C:\Python311\Scripts> .\radian.exe
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Platform: x86_64-w64-mingw32 (64-bit)

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

r$> exit()
(base) PS C:\Python311\Scripts> .\radian.exe --version
radian version: 0.6.6
r executable: C:\PROGRA~1\R\R-43~1.1\bin\R
r version: 4.3.1
python executable: C:\Python311\python.exe
python version: 3.11.4
randy3k commented 10 months ago

I figured out why my radian doesn't produce the error in default. I was using "Git for bash". For some reason, it has "correctly" forced python to use the utf-8 codepage.

Git for bash (note that warning message)

$ radian
During startup - Warning message:
Using locale code page other than 1252 may cause problems.
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Platform: x86_64-w64-mingw32 (64-bit)

r$> l10n_info()$codepage
[1] 65001

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "051aee0c8529378c027b69f4bfcfa88a"

Windows Terminal

PS C:\Users\Randy\Desktop> radian
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Platform: x86_64-w64-mingw32 (64-bit)

r$>  l10n_info()$codepage
[1] 1252

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "504a0ceaac24e5bd4f54c1b2ebd32e7a"

r$> Sys.setlocale(locale = "English_New Zealand.utf8")
[1] "LC_COLLATE=English_New Zealand.utf8;LC_CTYPE=English_New Zealand.utf8;LC_MONETARY=English_New Zealand.ut"
Warning message:
In Sys.setlocale(locale = "English_New Zealand.utf8") :
  using locale code page other than 1252 may cause problems

r$> digest::digest(mtcars, serialize = TRUE, serializeVersion = 3L)
[1] "051aee0c8529378c027b69f4bfcfa88a"

Edit: It seems that it is because Git for bash have set the environment variable LC_CTYPE = 'en_US.UTF-8'.

randy3k commented 10 months ago

I think a solution is to always force LC_CTYPE to en_US.UTF-8, see 60ccb0a.

randy3k commented 10 months ago

radian 0.6.7 is out.

randy3k commented 8 months ago

Unfortunately, changing the locale breaks the plot() function. We might have to revert the change here.