tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

Full support for foreign UTF-8 characters on non-UTF-8 locales #2469

Closed krlmlr closed 7 years ago

krlmlr commented 7 years ago

With the move to rlang, symbols serve as containers for column names, at least temporarily. Unfortunately, the roundtrip character -> symbol -> character is lossy for foreign UTF-8 characters that are not representable in the current locale (#1950, https://github.com/hadley/rlang/pull/43). The following approach will provide almost full support for such column names with only limited effort:

  1. Column names are internally stored as strings with encoding information (already implemented in #2388).
  2. For all column names, the native representation is used whenever performing lookup. This should be done anyway, because all names in environments are in the native encoding, as @lionel- discovered recently. (We use environments in the hybrid evaluator.)
  3. When creating column names for output, the original UTF-8 representation is used.

This only requires that the conversion character -> symbol is lossless and injective for the data frames involved. We should therefore check for duplicate column names in the native encoding (but still report clashes for the original, unmangled names).

krlmlr commented 7 years ago

Base R mostly behaves the same: foreign UTF-8 characters and their <U+xxxx> counterparts are treated interchangeably. Only [ misbehaves.

x <- list(1)
names(x) <- "\u5E78\u798F"
x
#> $`<U+5E78><U+798F>`
#> [1] 1
x$"\u5E78\u798F"
#> [1] 1
x$"<U+5E78>\u798F"
#> [1] 1
x$"<U+5E78><U+798F>"
#> [1] 1
x["\u5E78\u798F"]
#> $`<U+5E78><U+798F>`
#> [1] 1
x["<U+5E78>\u798F"]
#> $<NA>
#> NULL
x["<U+5E78><U+798F>"]
#> $<NA>
#> NULL
x[["\u5E78\u798F"]]
#> [1] 1
x[["<U+5E78>\u798F"]]
#> [1] 1
x[["<U+5E78><U+798F>"]]
#> [1] 1
lionel- commented 7 years ago

@krlmlr does this happen because the [[ implementation translates the vector names to native before matching?

krlmlr commented 7 years ago

I guess that both column names and index are reencoded; the example x[["<U+5E78>\u798F"]] seems to confirm that. The behavior of [ is most likely a bug.

lionel- commented 7 years ago

so, in some sense, R support for datasets containing unicode points is already broken in systems with a non-UTF8 locale.

e.g. if you have vector names mixing the serialised unicode point and its UTF-8 representation, indexing the unicode point might match the translated UTF-8 name instead.

krlmlr commented 7 years ago

"Broken" in this context probably means "works as well as possible given the circumstances", except that [ doesn't work (which may also be a feature of sorts).

lionel- commented 7 years ago

"works as well as possible given the circumstances"

Exactly. And maybe that's what dplyr should aim for as well, either with your deserialisation patch or the symbol table idea (for the symbol → string conversion issue).

krlmlr commented 7 years ago

The principal point of this ticket is that we don't seem to need the full roundtrip (i.e., we can skip the symbol → string part).

lionel- commented 7 years ago

The point of tidyeval is to simplify all NSE matters. This means translating from strings to symbols and back all the time. Most verbs expect symbols, so that's what you got to give them. And at the end we often need to overwrite or create columns, so we have to translate back to a string.

Making the system more complicated in all tidyverse packages just to avoid string → symbol is not the right approach if we have an alternative that will work in 99% of cases (probably even closer to 1 than that).

krlmlr commented 7 years ago

we have to translate back to a string

We don't if we always keep the original string, but I agree that this makes the system more complicated.

I only planned to avoid symbol → string, but your comment got me thinking. We should really strive for simplicity. We could always transform sequences of the form <U+xxxx> to their UTF-8 equivalents (given that they really represent a valid code point), in both symbols (i.e., symbol("x")) and names (i.e., names2(x)). Can we always invoke the translation in names2(), and also provide a C entry point?

lionel- commented 7 years ago

There doesn't seem to be any point in deserialising strings in all functions dealing with strings (because, why stop at names). We know things can go wrong only for string → symbol on systems with non-UTF8 locale, so we should only try to fix it for symbol → string and only on these systems.

lionel- commented 7 years ago

In fact I think that if we do something for this issue, it should necessarily be a palliative, as a general solution can only be implemented in R itself.

In that spirit, the parallel symbol table idea seems to be the right one. We only add symbols to it when we detect a translation problem. This way we have something that works according to expectations for virtually all cases. You'd have to create an artificial case to make this solution misbehave.

krlmlr commented 7 years ago

It looks like we need to fix names2(), because this is where lookup can fail. We might also need to create a replacement for ls(). I wouldn't go as far as fixing all string handling functions, because character vectors are not usually involved in lossy transitions unless explicitly asked for (e.g., enc2native()).

Fixing names2() and env_list() eliminates the need for a parallel symbol table.

krlmlr commented 7 years ago

env_names() perhaps.

krlmlr commented 7 years ago

Closing in favor of #2471.