tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.74k stars 2.12k forks source link

perfromance slowdown using across within mutate #6985

Open nirguk opened 7 months ago

nirguk commented 7 months ago

I believe this is an unexplored performance issue, seemingly relating to dplyr::expand_across

Benchmarked over a 1000 repetitions of processing ames data; There is a marked difference between direct mutation, and indirect mutation faciliated by across , seemingly both when using where() selection, and explicit all_of(c(..)) style selection. The latter speed degredation (of direct listing through all_of(c(...)) I think shows that the issue wont be related to checking properties a la the where() instant.

I think the performance issue is significant, with direct mutation approx 3x faster than that mediated by across

# A tibble: 4 × 9
  expression                                             min   median *`itr/sec`* mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 acrosswhere_func(ames_narrow)                       3.69ms   4.26ms      *219.*    1.75MB     7.70   966    34      4.42s
2 across_all_of_func(ames_narrow)                      3.3ms   3.83ms      *256.*   64.73KB     8.20   969    31      3.78s
3 direct_mutate_func(ames_narrow)                      1.1ms   1.26ms      *766.*   48.59KB     8.52   989    11      1.29s
4 direct_mutate_with_class_detect_func(ames_narrow)   1.22ms   1.36ms      *722.*   71.12KB     8.77   988    12      1.37s

I came across and considered whether this was related to #6897; however I believe it is something else. Here when using across I use the anonymous function syntax as advised.

first a reprex and then my session info...

library(bench)
library(tidyverse)
library(modeldata)
options("lifecycle_verbosity"="error")

(ames_narrow <- ames |> select(1:5))

num_op <- mean
char_op <- identity

acrosswhere_func <- function(a){
  mutate(a,
         across(where(is.numeric),\(x){num_op(x)}),
         across(where(is.character)|where(is.factor),\(x){char_op(x)}))
}

across_all_of_func <- function(a){
  mutate(a,
         across(all_of(c("Lot_Frontage","Lot_Area")),\(x){num_op(x)}),
         across(all_of(c("MS_SubClass","MS_Zoning","Street")),\(x){char_op(x)}))
}

direct_mutate_func <- function(a){
  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

direct_mutate_with_class_detect_func <- function(a){

  l <- map_lgl(a,\(x)is.numeric(x))
  numnames <- names(l[l])
  l <- map_lgl(a,\(x){is.character(x)|is.factor(x)})
  catnames <- names(l[l])

  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

b1 <- mark(acrosswhere_func(ames_narrow),
           across_all_of_func(ames_narrow),
           direct_mutate_func(ames_narrow),
           direct_mutate_with_class_detect_func(ames_narrow),iterations = 1000L)

select(b1,1:9)

session info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] modeldata_1.2.0 lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
 [8] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 bench_1.1.3    

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0 magrittr_2.0.3    hms_1.1.3         tidyselect_1.2.0  munsell_0.5.0     timechange_0.2.0 
 [7] colorspace_2.1-0  R6_2.5.1          rlang_1.1.3       fansi_1.0.4       tools_4.2.2       grid_4.2.2       
[13] gtable_0.3.4      utf8_1.2.3        cli_3.6.2         withr_2.5.0       lifecycle_1.0.3   tzdb_0.4.0       
[19] vctrs_0.6.5       glue_1.6.2        stringi_1.7.8     compiler_4.2.2    pillar_1.9.0      generics_0.1.3   
[25] scales_1.2.1      profmem_0.6.0     pkgconfig_2.0.3 
etiennebacher commented 7 months ago

Hi, I'm not a dplyr dev (or a tidyverse dev at all), but I'm not sure what you expect here. across() simply has to do more operations since it must evaluate the tidy selection passed in .cols and there are probably other checks and steps that need to be done. Note that the across() call with where() is the slowest because it must evaluate the condition on all columns and retain only those where this condition is true.

Moreover, this timing difference barely scales with the number of rows and columns in the data (except for where() that increases with the number of columns). On my machine, the difference is always 3-4ms. I don't think this overhead is important, but if it is in your case maybe you should consider alternative packages like data.table that are built for performance.