rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

replace arrange with radix sort #358

Closed kriemo closed 4 years ago

kriemo commented 4 years ago

Switching to base R sorting provides a performance gain over arrange() and future proofs the code against a regression in dev dplyr that results in a ~30x slow down. This PR also replaces the sorting code in bed_random() withbed_sort() to avoid redundancy

Current master performance (CRAN dplyr)

library(valr)
library(microbenchmark)
n <- 1e6
seed_x <- 1010486
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

x <- bed_random(genome, n = n, seed = seed_x)

microbenchmark(bed_sort(x),
               bed_random(genome, n = n, seed = seed_x),
               times = 2,
               unit = 's')
#> Unit: seconds
#>                                      expr        min         lq       mean
#>                               bed_sort(x) 0.03779551 0.03779551 0.03914011
#>  bed_random(genome, n = n, seed = seed_x) 0.68720326 0.68720326 0.69683440
#>      median         uq        max neval cld
#>  0.03914011 0.04048472 0.04048472     2  a
#>  0.69683440 0.70646555 0.70646555     2   b

With these changes that use base r radix sort (ported from data.table since r 3.3.0)

#> Unit: seconds
#>                                      expr        min         lq       mean
#>                               bed_sort(x) 0.02048263 0.02048263 0.02155842
#>  bed_random(genome, n = n, seed = seed_x) 0.15447826 0.15447826 0.16032385
#>      median         uq        max neval cld
#>  0.02155842 0.02263422 0.02263422     2  a
#>  0.16032385 0.16616943 0.16616943     2   b

Created on 2020-03-22 by the reprex package (v0.3.0)

kriemo commented 4 years ago

ok to merge?