tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

Faster `arrange()` #364

Closed markfairbanks closed 2 years ago

markfairbanks commented 2 years ago

arrange() can utilize setorder() to be faster.

This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call copy() outright.

Note: In the case of a transmute expression inside arrange() we would need to still use order(). Ex: arrange(df, x - 1)

library(stringi)
library(data.table)

rand_strings <- stri_rand_strings(100, 4)

bench::press(
  data_size = c(10000, 1000000, 10000000),
  {
    df <- data.table(a = sample(rand_strings, data_size, TRUE),
                     b = round(runif(data_size), 4))
    bench::mark(
      order = df[order(a, b)],
      setorder = setorder(copy(df), a, b),
      check = FALSE, iterations = 30
    )
  }
)
#> Running with:
#>   data_size
#> 1     10000
#> 2   1000000
#> 3  10000000
#> # A tibble: 6 × 7
#>   expression data_size      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 order          10000   1.06ms   1.38ms    703.      2.01MB     0   
#> 2 setorder       10000   1.01ms   1.23ms    800.    371.95KB     0   
#> 3 order        1000000  56.75ms  66.49ms     15.0    19.09MB     9.97
#> 4 setorder     1000000  45.15ms   51.5ms     19.0    27.67MB    34.9 
#> 5 order       10000000 676.79ms 690.23ms      1.43  190.75MB     4.91
#> 6 setorder    10000000 560.98ms 648.71ms      1.55  276.58MB     3.88

# With implicit copy (from filter/slice/select/summarize)
bench::press(
  data_size = c(10000, 1000000, 10000000),
  {
    df <- data.table(a = sample(rand_strings, data_size, TRUE),
                     b = round(runif(data_size), 4))
    bench::mark(
      order = df[, .(a, b)][order(a, b)],
      setorder = setorder(df[, .(a, b)], a, b),
      check = FALSE, iterations = 30
    )
  }
)
#> Running with:
#>   data_size
#> 1     10000
#> 2   1000000
#> 3  10000000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 6 × 7
#>   expression data_size      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 order          10000   1.65ms   1.97ms    488.     756.3KB     0   
#> 2 setorder       10000   1.35ms   1.58ms    622.     488.4KB    21.4 
#> 3 order        1000000   73.8ms  80.88ms     11.7     49.7MB     1.30
#> 4 setorder     1000000   59.8ms  67.33ms     14.7       43MB     2.26
#> 5 order       10000000 766.43ms 914.88ms      1.09     496MB     1.71
#> 6 setorder    10000000 628.64ms 733.18ms      1.35   429.2MB     1.67

Created on 2022-06-22 by the reprex package (v2.0.1)

markfairbanks commented 2 years ago

Might be a good idea to wait on fixing #210 though

Edit: Needs #366