Poor performance of `$sort()`

etiennebacher commented 11 months ago

There's something wrong about polars performance for sorting, I don't know if it's due to the mismatch between our rust-polars version and the latest one or if it's due to a bug somewhere (or something else):

library(polars)

dat <- data.frame(
  a = sample(letters, 5*1e7, TRUE),
  b = sample(letters, 5*1e7, TRUE),
  c = sample(letters, 5*1e7, TRUE)
)
dat_pl <- pl$DataFrame(dat)

bench::mark(
  dplyr = dplyr::arrange(dat, a, b, c),
  polars = dat_pl$sort("a", "b", "c"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         7.21s    7.21s    0.139     2.29GB    0.278
#> 2 polars       28.02s   28.02s    0.0357   558.2KB    0

sorhawell commented 11 months ago

EDIT updated benchmark and comments

Touche! There is something to this issue.

It appears dplyr is betting on radix-sort.

I followed the polars backward a bit hasty and arrived at a variation of unstable quick-sort in the rayon library for parallel and a similar for non parallel. Stable sort is probably also related to quick-sort I guess.

Radix sort is very powerful when the number of unique values is low, e.g. 26 letters in this case. If sorting random floats or anything with few re-occurrences, then quick-sort is quicker.

It is fair issue, that this is not stated in our docs nor have recommendations on how to get a radix sort with polars. I don't know if that is possible ? I should ask on their discord.

Here's a benchmark favoring quick-sort on floats and string because few or non re-occurrences

In findings below polars is faster than dplyr also for letters but it makes sense as n=1e7 only. I guess the complexity of quick sort is something like O(n logn) and for radix it is O(n) in the best case. So radix gets stronger than quicksort as datasizes increases and eventually any radix-impl should overtake a quicksort-impl for random single letters combination.

library(polars)
n = 1e7
dat_dbl <- data.frame(
  a = runif(n),
  b = runif(n),
  c = runif(n)
)
dat_dbl_pl <- pl$DataFrame(dat_dbl)

dat_str <- dat_dbl
dat_str[] <- lapply(dat_str, as.character)
dat_str_pl <- pl$DataFrame(dat_str) # yikes not fast conversion here, that's on r-polars

dat_letters <- data.frame(
  a = sample(letters,n,TRUE),
  b = sample(letters,n,TRUE),
  c = sample(letters,n,TRUE)
)
dat_letters_pl <- pl$DataFrame(dat_letters)

bench::mark(
  dplyr_dbl = dplyr::arrange(dat_dbl, a, b, c),
  polars_dbl = dat_dbl_pl$sort("a", "b", "c"),

  dplyr_str = dplyr::arrange(dat_str, a, b, c),
  polars_str = dat_str_pl$sort("a", "b", "c"),

  dplyr_letters = dplyr::arrange(dat_letters, a, b, c),
  dat_letters_pl =  dat_letters_pl$sort("a", "b", "c"),

  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 6 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_dbl         4.35s    4.35s    0.230   596.28MB   0.230 
#> 2 polars_dbl        1.25s    1.25s    0.799   515.48KB   0     
#> 3 dplyr_str        33.66s   33.66s    0.0297  900.74MB   0.0297
#> 4 polars_str        6.17s    6.17s    0.162     4.45KB   0     
#> 5 dplyr_letters     8.28s    8.28s    0.121   468.38MB   0.121 
#> 6 dat_letters_pl     6.4s     6.4s    0.156     4.45KB   0
Created on 2023-08-20 with [reprex v2.0.2](https://reprex.tidyverse.org/)

sorhawell commented 11 months ago

r-polars does have approx_unique which is quite cheap, e.g. ~30ms for dat_str_pl. That could be used first to decide on sorting strategy

etiennebacher commented 11 months ago

Thanks for the info! There's an issue in polars repo about using radix but there's no concrete decision made yet: https://github.com/pola-rs/polars/issues/10077

jangorecki commented 8 months ago

Just for the context.

dplyr's radix sort implementation (lives in vctrs package) is derived from data.table radix sort, which was earlier contributed to base R. In case of implementing derived version for polars, one might want to be aware of license incompatibilities. dplyr handles that in LICENSE.note file.

radix sort implementation in data.table is in my (probably little biased) opinion a masterpiece. It was implemented by Matt Dowle and Arun Srinivasan, and was evolving over years since first implementation in 2014. Benchmark vs Intel TBB from 2018's slides: intel-tbb

header of src/forder.c file provides good summary, and potentially a roadmap for those willing to dive deep in implementing optimized radix sort:

  Inspired by :
  icount in do_radixsort in src/main/sort.c @ rev 51389.
  base::sort.list(method="radix") turns out not to be a radix sort, but a counting sort :
    http://r.789695.n4.nabble.com/method-radix-in-sort-list-isn-t-actually-a-radix-sort-tp3309470p3309470.html
  Wish granted for R to allow negatives (simple change) : https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15644
  Terdiman and Herf (LSB radix for floating point and other efficiency tricks) :
  http://codercorner.com/RadixSortRevisited.htm
  http://stereopsis.com/radix.html

  Previous version of this file was promoted into base R, see ?base::sort.
  Denmark useR! presentation https://github.com/Rdatatable/data.table/wiki/talks/useR2015_Matt.pdf
  Stanford DSC presentation https://github.com/Rdatatable/data.table/wiki/talks/DSC2016_ParallelSort.pdf
  JSM presentation https://github.com/Rdatatable/data.table/wiki/talks/JSM2018_Matt.pdf
  Techniques used :
    skewed groups are split in parallel
    finds unique bytes to save 256 sweeping
    skips already-grouped yet unsorted
    recursive group gathering for cache efficiency
    reuses R's global character cache (truelength on CHARSXP)
    compressed column can cross byte boundaries to use spare bits (e.g. 2 16-level columns in one byte)
    just the remaining part of key is reordered as the radix progresses
    columnar byte-key for within-radix MT cache efficiency

  Only forder() and dtwiddle() functions are meant for use by other C code in data.table, hence all other functions here are static.
  The coding techniques deployed here are for efficiency. The static functions are recursive or called repetitively and we wish to minimise
  overhead. They reach outside themselves to place results in the end result directly rather than returning many small pieces of memory.

etiennebacher commented 8 months ago

Thanks! r-polars is "just" a wrapper around rust-polars so I don't think we'll implement this directly here. That could be helpful for the rust-polars devs though

pola-rs / r-polars

Poor performance of `$sort()` #366