tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 285 forks source link

Talk about string encoding #1449

Open hadley opened 1 year ago

hadley commented 1 year ago

Formerly in R4DS


String encoding

When working with non-English text another common challenge is file encodings. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():

charToRaw("Hadley")

Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it's the American Standard Code for Information Interchange.

Things aren't so easy for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte b1 is "±", but in Latin2, it's "ą"! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emoji.

readr uses UTF-8 everywhere. This is a good default, but will fail for data produced by older systems that don't know use UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you'll get complete gibberish. For example:

#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)

x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)

To fix the problem you need to specify the encoding via the locale argument:

#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))

read_csv(x2, locale = locale(encoding = "Shift-JIS"))

How do you find the correct encoding? If you're lucky, it'll be included somewhere in the data documentation. Unfortunately, that's rarely the case, so readr provides guess_encoding() to help you figure it out. It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start. Expect to try a few different encodings before you find the right one.

guess_encoding(x1)
guess_encoding(x2)

Encodings are a rich and complex topic, and we've only scratched the surface here. If you'd like to learn more we recommend reading the detailed explanation at http://kunststube.net/encoding/.

hadley commented 1 year ago

Scratch that; found a new home.

davidrsch commented 1 year ago

Hey, I am trying to render your working on version of the second edition of R4DS and just breaks in those lines in the strings.qmd file:

guess_encoding(x1)
guess_encoding(x2)

I thought about opening an issue in that repository but I tried it outside the project and failed anyway, so I think it might be an issue with the guess_enconding() function or some configuration issue on my side (My OS is windows 10, I am using RStudio version 2022.12.0-353 and R version 4.2.2 (2022-10-31 ucrt) -- "Innocent and Trusting", and I updated to the developer version of readr).

guess_encoding(x1)
guess_encoding(x2)
guess_encoding("text\nEl Ni\xf1")
guess_encoding("text\n\x82")

All the previous line return the error:

Error in file.exists(x) : file name conversion problem -- name too long?

Which traceback is

5: file.exists(x)
4: empty_file(file)
3: read_lines_raw(file, n_max = n_max)
2: unlist(read_lines_raw(file, n_max = n_max))
1: guess_encoding("text\n\x82")

Why did I try it with the lines guess_encoding("text\nEl Ni\xf1") and guess_encoding("text\n\x82")? Because I noticed that it fails as soon as detect another encoding which is not ASCII, UFT-8 or Unicode.

x3 <- "text\nR is a great place to start your data science journey!"
x4 <- "a\n\x52\x20\x69\x73\x20\x61\x20\x67\x72\x65\x61\x74\x20\x70\x6c\x61\x63\x65\x20\x74\x6f\x20\x73\x74\x61\x72\x74\x20\x79\x6f\x75\x72\x20\x64\x61\x74\x61\x20\x73\x63\x69\x65\x6e\x63\x65\x20\x6a\x6f\x75\x72\x6e\x65\x79\x21"
guess_encoding(x3)
guess_encoding(x4)

It may fail and detect both as ASCII (I guess it's because the "a\n" I have to put at the beginning of the x4, so it wouldn't read it as a path) but doesn't return an error. Any advice about why I am getting this behaviour would be great.

hadley commented 1 year ago

Oh yeah, I've recently discovered that guess_encoding() is a bit flakey; I still don't know of any better tool, but we should tweak the coverage here.

hadley commented 1 year ago

Oh yeah, I've recently discovered that guess_encoding() is a bit flakey; I still don't know of any better tool, but we should tweak the coverage here.

davidrsch commented 1 year ago

Hey I think I might found the reason. As saw in the traceback, error happen in file.exists(x):

5: file.exists(x)
4: empty_file(file)
3: read_lines_raw(file, n_max = n_max)
2: unlist(read_lines_raw(file, n_max = n_max))
1: guess_encoding("text\n\x82")

Which is an R base function, so the issue might be that R (maybe only while using Windows as SO) cannot read a path which encoding is different than ASCII, UTF-8 or Unicode. I tested this by storing x1 and x2 and applying guess_enconding() worked.

x1 <- "El Ni\xf1o was particularly bad this year"
write.csv(x1, "E:/x1.csv")
guess_encoding("E:/x1.csv")
## # A tibble: 1 × 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.48
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
guess_encoding("E:/x2.csv")
write.csv(x2, "E:/x2.csv")
## # A tibble: 1 × 2
##   encoding confidence
##   <chr>         <dbl>
## 1 KOI8-R          0.3

So maybe a possible solution could be adding the storing lines in the code of the book and removing from guess_encoding() description that can read character strings until the issue with file.exists() is solved.

I think that a trick that can make guess_encoding() to work as intended is maybe to add an argument chat_strin = F, so It will read x as a path as default and as a character string if user change it to TRUE. I haven't checked the code of guess_encoding() function (so I do not know what are the necessary changes to make the suggested argument work) and I don't know if I explained the idea properly. I hope this help.