Closed jmaspons closed 4 months ago
Thanks @jmaspons. I did try that once, but it really mucked up lots of languages - lots ofname:<language>
are not necessarily UTF-8, even when they should be. It'd be great if that would work ...
I just set UTF-8 for all text that come from user input: tags and usernames. If something is possible, such as adding non-ascii characters, users will do it.
As the responses from overpass servers and the OSM db use UTF-8, no other encodings should be possible: https://github.com/ropensci/osmdata/blob/88ba93917dc7735cd664853cb6cf2eafa232b5e8/tests/testthat/fixtures/osm-multi.osm#L1
I took this as an opportunity to learn more about Encoding
. It's a vectorized function, so any call to Encoding() <-
actually converts, in our case, columns of a data.frame element-by-element. That's potentially inefficient, so I just wanted to ensure we weren't inadvertently greatly reducing efficiency here.
First a function adapted from ?Encoding
to construct a character string:
make_char <- function (keep_utf = FALSE) {
x <- "fran\xE7ais"
Encoding(x) <- "latin1"
x <- iconv(x, "latin1", "UTF-8")
if (!keep_utf) Encoding(x) <- "unknown"
return (x)
}
make_char ()
#> [1] "français"
This code then shows that enc2utf8
is around 30% more efficient than Encoding() <-
, presumably because of more efficient vectorization(?):
x <- make_char ()
y <- z <- rep (x, 1e6)
convy <- function (y) {
Encoding (y) <- "UTF-8"
y
}
convz <- function (z) {
enc2utf8 (z)
}
bench::mark (convy(y), convz(z)) [, 1:6]
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 convy(y) 51.7ms 52ms 19.2 7.63MB 8.23
#> 2 convz(z) 32.6ms 36ms 27.4 7.63MB 11.0
Then to the real business, starting with a function to make a data.frame with columns having different encodings:
make_df <- function (ncols = 20, nrows = 100) {
xf <- make_char (FALSE)
xt <- make_char (TRUE)
index <- runif (ncols) > 0.5
dat <- lapply (index, function (i) rep (make_char (i), nrows))
data.frame (do.call (cbind, dat))
}
Then the rest of this compares four different ways of writing the setenc_utf8()
function (noting the final class() <-
lines, because the lapply
call strips all class info):
# Apply `Encoding` directly to all columns:
setenc_utf8_a <- function(x) {
cl <- class (x)
char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
x [char_cols] <- lapply (x [char_cols], function (y) {
Encoding (y) <- "UTF-8"
y
})
class (x) <- cl
return (x)
}
# Apply `enc2utf8` directly to all columns:
setenc_utf8_b <- function(x) {
cl <- class (x)
char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
x [char_cols] <- lapply (x [char_cols], function (y) {
enc2utf8 (y)
})
class (x) <- cl
return (x)
}
# Apply `Encoding` only to columns identified as having different encodings:
setenc_utf8_c <- function(x) {
cl <- class (x)
char_cols <- which (vapply (x, function (i) {
is.character (i) & any (Encoding (i) != "UTF-8")
}, FUN.VALUE = logical (1)))
x [char_cols] <- lapply (x [char_cols], function (y) {
Encoding (y) <- "UTF-8"
y
})
class (x) <- cl
return (x)
}
# Apply `enc2utf8` only to columns identified as having different encodings:
setenc_utf8_d <- function(x) {
cl <- class (x)
char_cols <- which (vapply (x, function (i) {
is.character (i) & any (Encoding (i) != "UTF-8")
}, FUN.VALUE = logical (1)))
x [char_cols] <- lapply (x [char_cols], function (y) {
enc2utf8 (y)
})
class (x) <- cl
return (x)
}
... then the results:
x <- make_df ()
bench::mark (
setenc_utf8_a (x),
setenc_utf8_b (x),
setenc_utf8_c (x),
setenc_utf8_d (x),
check = TRUE
) [, 1:6]
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 setenc_utf8_a(x) 202µs 212µs 4187. 18.58KB 12.7
#> 2 setenc_utf8_b(x) 150µs 156µs 6274. 9.47KB 14.5
#> 3 setenc_utf8_c(x) 223µs 236µs 3899. 33.38KB 8.25
#> 4 setenc_utf8_d(x) 199µs 205µs 4747. 33.38KB 10.4
And the second version, just applying setenc_utf8
to all columns regardless of current encoding, is the most efficient by at least 25%. The next bits show that relative efficiencies scale in highly non-linear ways:
nrows <- round (10 ^ ((2:8) / 2))
# Compare version "b" to the other 3 for different `nrows`:
compare1 <- function (nrows = 100) {
x <- make_df (nrows = nrows)
res <- as.numeric (bench::mark (
setenc_utf8_a (x),
setenc_utf8_b (x),
setenc_utf8_c (x),
setenc_utf8_d (x),
check = TRUE
)$median)
res <- res [-2] / res [2]
names (res) <- c ("a", "c", "d")
res
}
out <- lapply (nrows, function (i) compare1 (nrows = i))
out <- data.frame (do.call (rbind, out))
out$nrows <- nrows
out <- tidyr::pivot_longer (out, !nrows)
library (ggplot2)
ggplot (out, aes (x = nrows, y = value, colour = name)) +
geom_line () +
scale_x_log10 () +
theme (
legend.position = "inside",
legend.position.inside = c (0.1, 0.9),
legend.title = element_blank ()
)
Created on 2024-07-24 with reprex v2.1.1
Those are relative inefficiencies of all ways other than the best, which is simply applying enc2utf8()
to every column. The nrows
axis is logarithmically scaled, so using enc2utf8
becomes much more efficient with increasing numbers of rows.
enc2utf8()
instead of Encoding() <-
lapply
calls over data frame columnshttps://rlang.r-lib.org/reference/as_utf8_character.html
Same code as above for setenc_utf8_a()
and setenc_utf8_b()
, but with additional:
setenc_utf8_c <- function(x) {
cl <- class (x)
char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
x [char_cols] <- lapply (x [char_cols], function (y) {
rlang::as_utf8_character (y)
})
class (x) <- cl
return (x)
}
The final plot then looks like this, comparing both with base R enc2utf8()
:
So rlang
because then even more efficient than base R for large numbers of rows. I don't think it's worth adding an extra dependency just for this case, and suspect we'd never really notice the difference in practice, but wanted it recorded here for future reference regardless.
Thanks for the analysis! I will also use the insights in osmapiR :smile:
I tried an alternative that resulted the slowest:
setenc_utf8_mat <- function(x) {
cl <- class (x)
char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
m <- as.matrix(x [char_cols])
m <- enc2utf8(m)
x [char_cols] <- m
class (x) <- cl
return (x)
}
Another alternative could be to look for options in the Rcpp side ... I will have a look.
I commit the changes in #347
Add code to ensure that class attributes are not lost in the
lapply
calls over data frame columns
I haven't found any case where the input isn't a pure data.frame
. Furthermore, the code doesn't change the input but the items of the input (df columns or list items) and shouldn't don't change the class of the input:
x <- data.frame(1:2, LETTERS[1:2])
class(x) <- c("myClass", "data.frame")
char_cols <- which (vapply (x, is.character, FUN.VALUE = logical (1)))
x [char_cols] <- lapply (x [char_cols], function (y) {
enc2utf8 (y)
})
class(x)
#> [1] "myClass" "data.frame"
Created on 2024-07-25 with reprex v2.1.0
Created on 2024-07-23 with reprex v2.1.0
Some values contain UTF8 characters, but osmdata doesn't set the encoding. This can be problematic (e.g. for data on a package when R CMD check)
I'll try to do a PR