Closed statsmaths closed 6 years ago
Nothing you can do, since the Encoding()
bit cannot distinguish ASCII fro UTF-8 and the bit will not stick even if you set it.
> txt <- "€ euro"
> Encoding(txt)
[1] "UTF-8"
> txt2 <- "euro"
> Encoding(txt2)
[1] "unknown"
> Encoding(txt2) <- "UTF-8"
> Encoding(txt2)
[1] "unknown"
I think there are two variables here: the sequence of bytes and the declared encoding.
Does that make sense?
Reading the stringi manual a bit more:
native
(a.k.a.unknown
inEncoding
; quite a misleading name: no explicit encoding mark) -- for strings that are assumed to be in your platform's native (default) encoding. This can represent UTF-8 if you are an OS X user, or some 8-bit Windows code page, for example. The native encoding used by R may be determined by examining the LC_CTYPE category, see Sys.getlocale.
I guess that means that a string identified as unknown
might actually even be UTF-8! Has anyone ever seen that happen? If so, we not only need to check the declared encoding and the bytes, but also the native encoding in use...
After a long discussion, decided not to check string encodings given the various difficulties described above.
On my machine, the following correctly yields that the encoding is UTF-8:
However, it seems that if a string is ASCII, R will not tag it as such. For the following I get:
In other words, not unless you include to the non-ASCII "ż" it will not preserve the UTF-8 notation. I cannot find a way to force R to mark strings representable in ASCII space as UTF-8. Am I missing something here?
How should we be handling this when checking the validity of the corpus and tokens objects, which we said had to be in UTF-8? If R was marking these strictly as ASCII that would be okay as it is a subset of UTF-8, but "unknown" could really be anything given a user's local set up.