Closed krlmlr closed 7 years ago
Base R mostly behaves the same: foreign UTF-8 characters and their <U+xxxx>
counterparts are treated interchangeably. Only [
misbehaves.
x <- list(1)
names(x) <- "\u5E78\u798F"
x
#> $`<U+5E78><U+798F>`
#> [1] 1
x$"\u5E78\u798F"
#> [1] 1
x$"<U+5E78>\u798F"
#> [1] 1
x$"<U+5E78><U+798F>"
#> [1] 1
x["\u5E78\u798F"]
#> $`<U+5E78><U+798F>`
#> [1] 1
x["<U+5E78>\u798F"]
#> $<NA>
#> NULL
x["<U+5E78><U+798F>"]
#> $<NA>
#> NULL
x[["\u5E78\u798F"]]
#> [1] 1
x[["<U+5E78>\u798F"]]
#> [1] 1
x[["<U+5E78><U+798F>"]]
#> [1] 1
@krlmlr does this happen because the [[
implementation translates the vector names to native before matching?
I guess that both column names and index are reencoded; the example x[["<U+5E78>\u798F"]]
seems to confirm that. The behavior of [
is most likely a bug.
so, in some sense, R support for datasets containing unicode points is already broken in systems with a non-UTF8 locale.
e.g. if you have vector names mixing the serialised unicode point and its UTF-8 representation, indexing the unicode point might match the translated UTF-8 name instead.
"Broken" in this context probably means "works as well as possible given the circumstances", except that [
doesn't work (which may also be a feature of sorts).
"works as well as possible given the circumstances"
Exactly. And maybe that's what dplyr should aim for as well, either with your deserialisation patch or the symbol table idea (for the symbol → string conversion issue).
The principal point of this ticket is that we don't seem to need the full roundtrip (i.e., we can skip the symbol → string part).
The point of tidyeval is to simplify all NSE matters. This means translating from strings to symbols and back all the time. Most verbs expect symbols, so that's what you got to give them. And at the end we often need to overwrite or create columns, so we have to translate back to a string.
Making the system more complicated in all tidyverse packages just to avoid string → symbol
is not the right approach if we have an alternative that will work in 99% of cases (probably even closer to 1 than that).
we have to translate back to a string
We don't if we always keep the original string, but I agree that this makes the system more complicated.
I only planned to avoid symbol → string, but your comment got me thinking. We should really strive for simplicity. We could always transform sequences of the form <U+xxxx>
to their UTF-8 equivalents (given that they really represent a valid code point), in both symbols (i.e., symbol("x")
) and names (i.e., names2(x)
). Can we always invoke the translation in names2()
, and also provide a C entry point?
There doesn't seem to be any point in deserialising strings in all functions dealing with strings (because, why stop at names). We know things can go wrong only for string → symbol
on systems with non-UTF8 locale, so we should only try to fix it for symbol → string
and only on these systems.
In fact I think that if we do something for this issue, it should necessarily be a palliative, as a general solution can only be implemented in R itself.
In that spirit, the parallel symbol table idea seems to be the right one. We only add symbols to it when we detect a translation problem. This way we have something that works according to expectations for virtually all cases. You'd have to create an artificial case to make this solution misbehave.
It looks like we need to fix names2()
, because this is where lookup can fail. We might also need to create a replacement for ls()
. I wouldn't go as far as fixing all string handling functions, because character vectors are not usually involved in lossy transitions unless explicitly asked for (e.g., enc2native()
).
Fixing names2()
and env_list()
eliminates the need for a parallel symbol table.
env_names()
perhaps.
Closing in favor of #2471.
With the move to
rlang
, symbols serve as containers for column names, at least temporarily. Unfortunately, the roundtrip character -> symbol -> character is lossy for foreign UTF-8 characters that are not representable in the current locale (#1950, https://github.com/hadley/rlang/pull/43). The following approach will provide almost full support for such column names with only limited effort:This only requires that the conversion character -> symbol is lossless and injective for the data frames involved. We should therefore check for duplicate column names in the native encoding (but still report clashes for the original, unmangled names).