Closed psychelzh closed 2 years ago
It looks like this isn't an across.()
issue. I'll look into it further though and see what's causing it.
data <- data.frame(x = runif(1e6), y = runif(1e6))
bench::mark(
dplyr = data |>
dplyr::mutate(dplyr::across(.fns = identity)),
tidytable = data |>
tidytable::mutate.(tidytable::across.(.fns = identity)),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 1.69ms 2.04ms 452. 19.5MB 10.6
#> 2 tidytable 5.24ms 5.91ms 150. 19.7MB 158.
For some reason this is only occurring with as.character()
when overwriting an existing column. I originally thought maybe it had something to do with string operations - but paste0()
doesn't have performance issues.
pacman::p_load(tidytable, dplyr)
data <- data.frame(x = runif(1e6)) %>%
mutate.(across.(.fns = ~ round(.x, 4)))
bench::mark(
dplyr_as.character = data |>
dplyr::mutate(x = as.character(x)),
tidytable_as.character = data |>
mutate.(x = as.character(x)),
tidytable_new_as.character = data |>
mutate.(new_x = as.character(x)),
tidytable_double = data |>
mutate.(x = x * 2),
dplyr_paste0 = data |>
dplyr::mutate(x = paste0(x, "_")),
tidytable_paste0 = data |>
mutate.(x = paste0(x, "_")),
check = FALSE, iterations = 10
)
#> # A tibble: 6 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr_as.character 1.1ms 1.35ms 710. 1.21MB 0
#> 2 tidytable_as.character 437.15ms 442.32ms 2.20 15.28MB 2.63
#> 3 tidytable_new_as.character 1.56ms 1.85ms 528. 340.67KB 0
#> 4 tidytable_double 2.99ms 3.48ms 278. 15.28MB 185.
#> 5 dplyr_paste0 580.36ms 594.94ms 1.68 15.28MB 1.12
#> 6 tidytable_paste0 590.97ms 599.06ms 1.67 22.91MB 1.11
The good news - I found a way to fix it by using tidytable's internal fast_copy()
so it's an easy fix. I've been meaning to make that change to mutate.()
anyway.
As I was digging into this more I noticed that fast_copy()
didn't fix the issue when chaining more mutate.()
calls. So I opened a data.table
issue.
https://github.com/Rdatatable/data.table/issues/5408
There's a pretty good explanation for what is occurring. Basically since tidytable
always "prints" the data.table
object it looks like it's slower even though it's just materializing the result at an earlier point than what dplyr
does. If you actually try and follow it up with a second string operation you'll see that dplyr
and tidytable
have more or less the same performance.
pacman::p_load(tidytable, dplyr)
data <- data.frame(x = runif(1e6)) %>%
mutate.(across.(.fns = ~ round(.x, 4)))
bench::mark(
dplyr = data |>
dplyr::mutate(x = as.character(x),
x = paste0(x, "_")),
tidytable = data |>
mutate.(x = as.character(x),
x = paste0(x, "_")),
check = FALSE, iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 591ms 599ms 1.67 16.5MB 2.51
#> 2 tidytable 565ms 565ms 1.77 30.5MB 23.0
If you have any questions let me know.
Thanks for your explanation. But what can I do if there is no further operations other than as.character()
? The situation is that I want to unnest()
the nested data.frame
s in one column, but the column types of these data.frame
s are not always the same. So I converted all the columns to character
before unnest()
.
I got it now. So actually as.character()
in mutate()
did nothing at all, but in mutate.()
it "completed" the real task! Thank you very much!
The situation is that I want to unnest() the nested data.frames in one column, but the column types of these data.frames are not always the same.
It might be worth trying unnest.()
without the mutate.(across.())
call. unnest.()
uses bind_rows.()
which uses data.table::rbindlist()
in the background. rbindlist()
automatically does type conversion as far as I know. Here's a simple example.
Note that column "y" is character in df1
and double in df2
.
library(tidytable, warn.conflicts = FALSE)
df1 <- tidytable(x = 1, y = "y")
df2 <- tidytable(x = 1, y = 2)
nested_df <- tidytable(id = 1, data = list(df1, df2))
nested_df %>%
unnest.(data)
#> # A tidytable: 2 × 3
#> id x y
#> <dbl> <dbl> <chr>
#> 1 1 1 y
#> 2 1 1 2
See this example:
Created on 2022-06-20 by the reprex package (v2.0.1)