Encoding is "unknown" when ASCII

ropenscilabs / tif

Text Interchange Formats

https://docs.ropensci.org/tif

35 stars 4 forks source link

Encoding is "unknown" when ASCII #2

Closed statsmaths closed 6 years ago

statsmaths commented 7 years ago

On my machine, the following correctly yields that the encoding is UTF-8:

x <- stri_encode("Zażółć gęślą jaźń", "", "UTF-8")
Encoding(x)

However, it seems that if a string is ASCII, R will not tag it as such. For the following I get:

Encoding(substr(x, 1, 1)) # "unknown"
Encoding(substr(x, 1, 2)) # "unknown"
Encoding(substr(x, 1, 3)) # "UTF-8"

In other words, not unless you include to the non-ASCII "ż" it will not preserve the UTF-8 notation. I cannot find a way to force R to mark strings representable in ASCII space as UTF-8. Am I missing something here?

How should we be handling this when checking the validity of the corpus and tokens objects, which we said had to be in UTF-8? If R was marking these strictly as ASCII that would be okay as it is a subset of UTF-8, but "unknown" could really be anything given a user's local set up.

kbenoit commented 7 years ago

Nothing you can do, since the Encoding() bit cannot distinguish ASCII fro UTF-8 and the bit will not stick even if you set it.

> txt <- "€ euro"
> Encoding(txt)
[1] "UTF-8"
> txt2 <- "euro"
> Encoding(txt2)
[1] "unknown"
> Encoding(txt2) <- "UTF-8"
> Encoding(txt2)
[1] "unknown"

adamobeng commented 7 years ago

I think there are two variables here: the sequence of bytes and the declared encoding.

If the declared encoding is 'UTF-8'
- and the sequence of bytes is valid UTF-8 (see stri_enc_isutf8), then it's valid
- and the sequence of bytes is not valid UTF-8, then it's invalid
If the declared encoding is 'unknown'
- and all bytes in the sequence are less than 127 (see stri_enc_mark), then we can assume it's ASCII and thus embeddable in UTF-8 and valid
- and not all bytes in the sequence are less than 127, then it could either be UTF-8 or some non-UTF-8 extended ASCII, so it's invalid
If the declared encoding is anything else, then it's invalid

Does that make sense?

adamobeng commented 7 years ago

Reading the stringi manual a bit more:

native (a.k.a. unknown in Encoding; quite a misleading name: no explicit encoding mark) -- for strings that are assumed to be in your platform's native (default) encoding. This can represent UTF-8 if you are an OS X user, or some 8-bit Windows code page, for example. The native encoding used by R may be determined by examining the LC_CTYPE category, see Sys.getlocale.

I guess that means that a string identified as unknown might actually even be UTF-8! Has anyone ever seen that happen? If so, we not only need to check the declared encoding and the bytes, but also the native encoding in use...

statsmaths commented 6 years ago

After a long discussion, decided not to check string encodings given the various difficulties described above.