Closed markfairbanks closed 2 years ago
arrange() can utilize setorder() to be faster.
arrange()
setorder()
This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call copy() outright.
copy()
Note: In the case of a transmute expression inside arrange() we would need to still use order(). Ex: arrange(df, x - 1)
order()
arrange(df, x - 1)
library(stringi) library(data.table) rand_strings <- stri_rand_strings(100, 4) bench::press( data_size = c(10000, 1000000, 10000000), { df <- data.table(a = sample(rand_strings, data_size, TRUE), b = round(runif(data_size), 4)) bench::mark( order = df[order(a, b)], setorder = setorder(copy(df), a, b), check = FALSE, iterations = 30 ) } ) #> Running with: #> data_size #> 1 10000 #> 2 1000000 #> 3 10000000 #> # A tibble: 6 × 7 #> expression data_size min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 order 10000 1.06ms 1.38ms 703. 2.01MB 0 #> 2 setorder 10000 1.01ms 1.23ms 800. 371.95KB 0 #> 3 order 1000000 56.75ms 66.49ms 15.0 19.09MB 9.97 #> 4 setorder 1000000 45.15ms 51.5ms 19.0 27.67MB 34.9 #> 5 order 10000000 676.79ms 690.23ms 1.43 190.75MB 4.91 #> 6 setorder 10000000 560.98ms 648.71ms 1.55 276.58MB 3.88 # With implicit copy (from filter/slice/select/summarize) bench::press( data_size = c(10000, 1000000, 10000000), { df <- data.table(a = sample(rand_strings, data_size, TRUE), b = round(runif(data_size), 4)) bench::mark( order = df[, .(a, b)][order(a, b)], setorder = setorder(df[, .(a, b)], a, b), check = FALSE, iterations = 30 ) } ) #> Running with: #> data_size #> 1 10000 #> 2 1000000 #> 3 10000000 #> Warning: Some expressions had a GC in every iteration; so filtering is disabled. #> # A tibble: 6 × 7 #> expression data_size min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 order 10000 1.65ms 1.97ms 488. 756.3KB 0 #> 2 setorder 10000 1.35ms 1.58ms 622. 488.4KB 21.4 #> 3 order 1000000 73.8ms 80.88ms 11.7 49.7MB 1.30 #> 4 setorder 1000000 59.8ms 67.33ms 14.7 43MB 2.26 #> 5 order 10000000 766.43ms 914.88ms 1.09 496MB 1.71 #> 6 setorder 10000000 628.64ms 733.18ms 1.35 429.2MB 1.67
Created on 2022-06-22 by the reprex package (v2.0.1)
Might be a good idea to wait on fixing #210 though
Edit: Needs #366
arrange()
can utilizesetorder()
to be faster.This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call
copy()
outright.Note: In the case of a transmute expression inside
arrange()
we would need to still useorder()
. Ex:arrange(df, x - 1)
Created on 2022-06-22 by the reprex package (v2.0.1)