[FEATURE] Set encoding to UTF8 for returned data

jmaspons commented 4 months ago

library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
q <- opq_osm_id(id = "13433228", type = "relation", out = "tags")
d <- osmdata_data_frame(q)
d
#>   osm_type   osm_id border_type boundary historic:admin_level historic:boundary
#> 1 relation 13433228     comarca historic                    7    administrative
#>       name  name:be   name:br  name:ca  name:de  name:en   name:eo  name:es
#> 1 Rosselló Русільён Rousilhon Rosselló Rosselló Rosselló Rusiljono Rosellón
#>      name:fr   name:oc   name:ru    name:uk     type wikidata   wikipedia
#> 1 Roussillon Rosselhon Руссильон Руссільйон boundary   Q15476 ca:Rosselló

sapply(d, Encoding)
#>             osm_type               osm_id          border_type 
#>            "unknown"            "unknown"            "unknown" 
#>             boundary historic:admin_level    historic:boundary 
#>            "unknown"            "unknown"            "unknown" 
#>                 name              name:be              name:br 
#>            "unknown"            "unknown"            "unknown" 
#>              name:ca              name:de              name:en 
#>            "unknown"            "unknown"            "unknown" 
#>              name:eo              name:es              name:fr 
#>            "unknown"            "unknown"            "unknown" 
#>              name:oc              name:ru              name:uk 
#>            "unknown"            "unknown"            "unknown" 
#>                 type             wikidata            wikipedia 
#>            "unknown"            "unknown"            "unknown"

^{Created on 2024-07-23 with reprex v2.1.0}

Some values contain UTF8 characters, but osmdata doesn't set the encoding. This can be problematic (e.g. for data on a package when R CMD check)

I'll try to do a PR

mpadge commented 4 months ago

Thanks @jmaspons. I did try that once, but it really mucked up lots of languages - lots ofname:<language> are not necessarily UTF-8, even when they should be. It'd be great if that would work ...

jmaspons commented 4 months ago

I just set UTF-8 for all text that come from user input: tags and usernames. If something is possible, such as adding non-ascii characters, users will do it.

As the responses from overpass servers and the OSM db use UTF-8, no other encodings should be possible: https://github.com/ropensci/osmdata/blob/88ba93917dc7735cd664853cb6cf2eafa232b5e8/tests/testthat/fixtures/osm-multi.osm#L1

mpadge commented 4 months ago

I took this as an opportunity to learn more about Encoding. It's a vectorized function, so any call to Encoding() <- actually converts, in our case, columns of a data.frame element-by-element. That's potentially inefficient, so I just wanted to ensure we weren't inadvertently greatly reducing efficiency here.

First a function adapted from ?Encoding to construct a character string:

make_char <- function (keep_utf = FALSE) {
    x <- "fran\xE7ais"
    Encoding(x) <- "latin1"
    x <- iconv(x, "latin1", "UTF-8")
    if (!keep_utf) Encoding(x) <- "unknown"
    return (x)
}
make_char ()
#> [1] "français"

This code then shows that enc2utf8 is around 30% more efficient than Encoding() <-, presumably because of more efficient vectorization(?):

x <- make_char ()
y <- z <- rep (x, 1e6)
convy <- function (y) {
    Encoding (y) <- "UTF-8"
    y
}
convz <- function (z) {
    enc2utf8 (z)
}
bench::mark (convy(y), convz(z)) [, 1:6]
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 convy(y)     51.7ms     52ms      19.2    7.63MB     8.23
#> 2 convz(z)     32.6ms     36ms      27.4    7.63MB    11.0

Then to the real business, starting with a function to make a data.frame with columns having different encodings:

make_df <- function (ncols = 20, nrows = 100) {
    xf <- make_char (FALSE)
    xt <- make_char (TRUE)
    index <- runif (ncols) > 0.5
    dat <- lapply (index, function (i) rep (make_char (i), nrows))
    data.frame (do.call (cbind, dat))
}

Then the rest of this compares four different ways of writing the setenc_utf8() function (noting the final class() <- lines, because the lapply call strips all class info):

# Apply `Encoding` directly to all columns:
setenc_utf8_a <- function(x) {
    cl <- class (x)
    char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
    x [char_cols] <- lapply (x [char_cols], function (y) {
        Encoding (y) <- "UTF-8"
        y
    })
    class (x) <- cl
    return (x)
}
# Apply `enc2utf8` directly to all columns:
setenc_utf8_b <- function(x) {
    cl <- class (x)
    char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
    x [char_cols] <- lapply (x [char_cols], function (y) {
        enc2utf8 (y)
    })
    class (x) <- cl
    return (x)
}
# Apply `Encoding` only to columns identified as having different encodings:
setenc_utf8_c <- function(x) {
    cl <- class (x)
    char_cols <- which (vapply (x, function (i) {
        is.character (i) & any (Encoding (i) != "UTF-8")
    }, FUN.VALUE = logical (1)))
    x [char_cols] <- lapply (x [char_cols], function (y) {
        Encoding (y) <- "UTF-8"
        y
    })
    class (x) <- cl
    return (x)
}
# Apply `enc2utf8` only to columns identified as having different encodings:
setenc_utf8_d <- function(x) {
    cl <- class (x)
    char_cols <- which (vapply (x, function (i) {
        is.character (i) & any (Encoding (i) != "UTF-8")
    }, FUN.VALUE = logical (1)))
    x [char_cols] <- lapply (x [char_cols], function (y) {
        enc2utf8 (y)
    })
    class (x) <- cl
    return (x)
}

... then the results:

x <- make_df ()
bench::mark (
    setenc_utf8_a (x),
    setenc_utf8_b (x),
    setenc_utf8_c (x),
    setenc_utf8_d (x),
    check = TRUE
) [, 1:6]
#> # A tibble: 4 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 setenc_utf8_a(x)    202µs    212µs     4187.   18.58KB    12.7 
#> 2 setenc_utf8_b(x)    150µs    156µs     6274.    9.47KB    14.5 
#> 3 setenc_utf8_c(x)    223µs    236µs     3899.   33.38KB     8.25
#> 4 setenc_utf8_d(x)    199µs    205µs     4747.   33.38KB    10.4

And the second version, just applying setenc_utf8 to all columns regardless of current encoding, is the most efficient by at least 25%. The next bits show that relative efficiencies scale in highly non-linear ways:

nrows <- round (10 ^ ((2:8) / 2))
# Compare version "b" to the other 3 for different `nrows`:
compare1 <- function (nrows = 100) {
    x <- make_df (nrows = nrows)
    res <- as.numeric (bench::mark (
        setenc_utf8_a (x),
        setenc_utf8_b (x),
        setenc_utf8_c (x),
        setenc_utf8_d (x),
        check = TRUE
    )$median)
    res <- res [-2] / res [2]
    names (res) <- c ("a", "c", "d")
    res
}
out <- lapply (nrows, function (i) compare1 (nrows = i))
out <- data.frame (do.call (rbind, out))
out$nrows <- nrows
out <- tidyr::pivot_longer (out, !nrows)

library (ggplot2)
ggplot (out, aes (x = nrows, y = value, colour = name)) +
    geom_line () +
    scale_x_log10 () +
    theme (
        legend.position = "inside",
        legend.position.inside = c (0.1, 0.9),
        legend.title = element_blank ()
    )

^{Created on 2024-07-24 with reprex v2.1.1}

Those are relative inefficiencies of all ways other than the best, which is simply applying enc2utf8() to every column. The nrows axis is logarithmically scaled, so using enc2utf8 becomes much more efficient with increasing numbers of rows.

Conclusion

Use enc2utf8() instead of Encoding() <-
Add code to ensure that class attributes are not lost in the lapply calls over data frame columns

mpadge commented 4 months ago

Update

https://rlang.r-lib.org/reference/as_utf8_character.html

Same code as above for setenc_utf8_a() and setenc_utf8_b(), but with additional:

setenc_utf8_c <- function(x) {
    cl <- class (x)
    char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
    x [char_cols] <- lapply (x [char_cols], function (y) {
        rlang::as_utf8_character (y)
    })
    class (x) <- cl
    return (x)
}

The final plot then looks like this, comparing both with base R enc2utf8():

So rlang because then even more efficient than base R for large numbers of rows. I don't think it's worth adding an extra dependency just for this case, and suspect we'd never really notice the difference in practice, but wanted it recorded here for future reference regardless.

jmaspons commented 4 months ago

Thanks for the analysis! I will also use the insights in osmapiR :smile:

I tried an alternative that resulted the slowest:

setenc_utf8_mat <- function(x) {
    cl <- class (x)
    char_cols <- vapply (x, is.character, FUN.VALUE = logical (1))
    m <- as.matrix(x [char_cols])
    m <- enc2utf8(m)
    x [char_cols] <- m 
    class (x) <- cl
    return (x)
}

Another alternative could be to look for options in the Rcpp side ... I will have a look.

jmaspons commented 4 months ago

I commit the changes in #347

Add code to ensure that class attributes are not lost in the lapply calls over data frame columns

I haven't found any case where the input isn't a pure data.frame. Furthermore, the code doesn't change the input but the items of the input (df columns or list items) and ~~shouldn't~~ don't change the class of the input:

x <- data.frame(1:2, LETTERS[1:2])
class(x) <- c("myClass", "data.frame")
char_cols <- which (vapply (x, is.character, FUN.VALUE = logical (1)))
x [char_cols] <- lapply (x [char_cols], function (y) {
    enc2utf8 (y)
})
class(x)
#> [1] "myClass"    "data.frame"

^{Created on 2024-07-25 with reprex v2.1.0}

ropensci / osmdata

[FEATURE] Set encoding to UTF8 for returned data #346

Conclusion

Update