markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

`as.character()` inside `mutate.()` is slow #506

Closed psychelzh closed 2 years ago

psychelzh commented 2 years ago

See this example:

data <- data.frame(x = runif(1e6))
bench::mark(
  dplyr = data |> 
    dplyr::mutate(dplyr::across(.fns = as.character)),
  tidytable = data |> 
    tidytable::mutate.(tidytable::across.(.fns = as.character)),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        3.66ms   4.38ms   214.       10.7MB        0
#> 2 tidytable     2.35s    2.35s     0.425    34.6MB        0

Created on 2022-06-20 by the reprex package (v2.0.1)

markfairbanks commented 2 years ago

It looks like this isn't an across.() issue. I'll look into it further though and see what's causing it.

data <- data.frame(x = runif(1e6), y = runif(1e6))
bench::mark(
  dplyr = data |> 
    dplyr::mutate(dplyr::across(.fns = identity)),
  tidytable = data |> 
    tidytable::mutate.(tidytable::across.(.fns = identity)),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        1.69ms   2.04ms      452.    19.5MB     10.6
#> 2 tidytable    5.24ms   5.91ms      150.    19.7MB    158.
markfairbanks commented 2 years ago

For some reason this is only occurring with as.character() when overwriting an existing column. I originally thought maybe it had something to do with string operations - but paste0() doesn't have performance issues.

pacman::p_load(tidytable, dplyr)

data <- data.frame(x = runif(1e6)) %>%
  mutate.(across.(.fns = ~ round(.x, 4)))

bench::mark(
  dplyr_as.character = data |>
    dplyr::mutate(x = as.character(x)),
  tidytable_as.character = data |> 
    mutate.(x = as.character(x)),
  tidytable_new_as.character = data |>
    mutate.(new_x = as.character(x)),
  tidytable_double = data |> 
    mutate.(x = x * 2),
  dplyr_paste0 = data |>
    dplyr::mutate(x = paste0(x, "_")),
  tidytable_paste0 = data |>
    mutate.(x = paste0(x, "_")),
  check = FALSE, iterations = 10
)
#> # A tibble: 6 × 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_as.character            1.1ms   1.35ms    710.      1.21MB     0   
#> 2 tidytable_as.character     437.15ms 442.32ms      2.20   15.28MB     2.63
#> 3 tidytable_new_as.character   1.56ms   1.85ms    528.    340.67KB     0   
#> 4 tidytable_double             2.99ms   3.48ms    278.     15.28MB   185.  
#> 5 dplyr_paste0               580.36ms 594.94ms      1.68   15.28MB     1.12
#> 6 tidytable_paste0           590.97ms 599.06ms      1.67   22.91MB     1.11
markfairbanks commented 2 years ago

The good news - I found a way to fix it by using tidytable's internal fast_copy() so it's an easy fix. I've been meaning to make that change to mutate.() anyway.

markfairbanks commented 2 years ago

As I was digging into this more I noticed that fast_copy() didn't fix the issue when chaining more mutate.() calls. So I opened a data.table issue.

https://github.com/Rdatatable/data.table/issues/5408

There's a pretty good explanation for what is occurring. Basically since tidytable always "prints" the data.table object it looks like it's slower even though it's just materializing the result at an earlier point than what dplyr does. If you actually try and follow it up with a second string operation you'll see that dplyr and tidytable have more or less the same performance.

pacman::p_load(tidytable, dplyr)

data <- data.frame(x = runif(1e6)) %>%
  mutate.(across.(.fns = ~ round(.x, 4)))

bench::mark(
  dplyr = data |>
    dplyr::mutate(x = as.character(x),
                  x = paste0(x, "_")),
  tidytable = data |> 
    mutate.(x = as.character(x),
            x = paste0(x, "_")),
  check = FALSE, iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         591ms    599ms      1.67    16.5MB     2.51
#> 2 tidytable     565ms    565ms      1.77    30.5MB    23.0

If you have any questions let me know.

psychelzh commented 2 years ago

Thanks for your explanation. But what can I do if there is no further operations other than as.character()? The situation is that I want to unnest() the nested data.frames in one column, but the column types of these data.frames are not always the same. So I converted all the columns to character before unnest().

I got it now. So actually as.character() in mutate() did nothing at all, but in mutate.() it "completed" the real task! Thank you very much!

markfairbanks commented 2 years ago

The situation is that I want to unnest() the nested data.frames in one column, but the column types of these data.frames are not always the same.

It might be worth trying unnest.() without the mutate.(across.()) call. unnest.() uses bind_rows.() which uses data.table::rbindlist() in the background. rbindlist() automatically does type conversion as far as I know. Here's a simple example.

Note that column "y" is character in df1 and double in df2.

library(tidytable, warn.conflicts = FALSE)

df1 <- tidytable(x = 1, y = "y")
df2 <- tidytable(x = 1, y = 2)

nested_df <- tidytable(id = 1, data = list(df1, df2))

nested_df %>%
  unnest.(data)
#> # A tidytable: 2 × 3
#>      id     x y    
#>   <dbl> <dbl> <chr>
#> 1     1     1 y    
#> 2     1     1 2