r-lib / slider

Sliding Window Functions
https://slider.r-lib.org
Other
295 stars 12 forks source link

Speed of slide_mean() #156

Closed MattCowgill closed 3 years ago

MattCowgill commented 3 years ago

Hi @DavisVaughan, First: I love {slider}, thank you for making it.

I'm keen to replace various other functions in my code with their {slider} equivalents. One problem I have is that zoo::rollmeanr() is faster (for me at least) than slider::slide_mean(). Here is an example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

size <- 100000
x <- tibble(num = rnorm(size, mean = 10, sd = 2),
            letters = sample(letters, size, replace = T))

f_slider <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = slider::slide_mean(x = num, 
                                     before = 11L,
                                    complete = TRUE))
}

f_zoo <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = zoo::rollmeanr(num, 12, fill = NA))
}

bench::mark(f_slider(x),
            f_zoo(x))
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_slider(x)  133.2ms    134ms      7.45    6.49MB       0 
#> 2 f_zoo(x)      21.7ms     22ms     44.0    30.06MB     235.

Created on 2021-06-09 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.5 (2021-03-31) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2021-06-09 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) #> bench 1.1.1 2020-01-13 [1] CRAN (R 4.0.2) #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> dplyr * 1.0.6 2021-05-05 [1] CRAN (R 4.0.5) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.2) #> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.5) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> profmem 0.6.0 2020-12-13 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.2) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2) #> rmarkdown 2.8 2021-05-07 [1] CRAN (R 4.0.2) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> slider 0.2.1 2021-03-23 [1] CRAN (R 4.0.2) #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.2) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.2) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2) #> warp 0.2.0 2020-10-21 [1] CRAN (R 4.0.2) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5) #> xfun 0.23 2021-05-15 [1] CRAN (R 4.0.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> zoo 1.8-9 2021-03-09 [1] CRAN (R 4.0.2) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```

I'm not clear whether the problem is with me (is there something in the example above I should change?) or if slide_mean() is just a bit slower than rollmean.

Thanks again

MattCowgill commented 3 years ago

A friend of mine ran the same code above and gets different results, with slide_mean() slightly faster than zoo::rollapply(). That makes me wonder if this is some weird M1 Mac issue.

DavisVaughan commented 3 years ago

Interesting, here is what I get on my 2018 Intel Mac running Mojave

library(dplyr)

size <- 100000
x <- tibble(num = rnorm(size, mean = 10, sd = 2),
            letters = sample(letters, size, replace = T))

f_slider <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = slider::slide_mean(x = num, 
                                     before = 11L,
                                     complete = TRUE))
}

f_zoo <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = zoo::rollmeanr(num, 12, fill = NA))
}

bench::mark(f_slider(x),
            f_zoo(x))
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_slider(x)   8.67ms   9.98ms      98.8    7.13MB     12.0
#> 2 f_zoo(x)     28.53ms   28.7ms      33.9   30.06MB    170.

It is possible this has to do with how efficiently your machine handles long doubles, but I'm not entirely sure

DavisVaughan commented 3 years ago

Could you try some benchmarks with slide_max() against rollmax()? That doesn't use long doubles.

And then again with slide_sum() against rollsum()? That uses long doubles, but in a slightly simpler way.

MattCowgill commented 3 years ago

Hi @DavisVaughan I have tried upgrading to the native arm64 build of R 4.1.0. slide_mean() is now extremely fast for me. Thank you - perhaps there is something about the Rosetta emulation on M1 Macs running x86 R that slows slider down in that situation.

library(tidyverse)
size <- 100000
x <- tibble(num = rnorm(size, mean = 10, sd = 2),
            letters = sample(1L:26L, size, replace = T)
)
f_slider <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = slider::slide_mean(x = num, 
                                     before = 11L,
                                     complete = TRUE))
}
f_zoo <- function(data) {
  data %>%
    group_by(letters) %>%
    mutate(mean = zoo::rollmeanr(num, k = 12L, fill = NA))
}

bench::mark(f_slider(x),
            f_zoo(x))
#> # A tibble: 3 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_slider(x)      4.8ms   4.98ms     199.     6.34MB     35.3
#> 2 f_zoo(x)       12.32ms  12.99ms      77.4   29.54MB    294. 

Thanks again for a great package

DavisVaughan commented 3 years ago

It does seem that the Rosetta 2 emulation uses extended precision 80-bit long doubles (which is what the Intel Mac uses), but native ARM supports only 64-bit long doubles (i.e. they are the same as a typical double).

My Mac also uses 80-bit long doubles since it is Intel, but is pretty fast, so maybe there is something strange going on in the Rosetta 2 emulation as you mentioned.

Search "Rosetta 2" here: https://stardot.org.uk/forums/viewtopic.php?t=22495

But there's one gotcha, which nobody (except me) ever seems to mention: ARM currently has no hardware support for floating-point arithmetic with a better precision than 64-bits ('double') whereas x86 has 80-bit floats ('long double'). I can't be alone in having applications which need better than 64-bit precision, typically because many calculations get chained and losing half-an-LSB at each step isn't acceptable. One such application, FIRBBC (which synthesises Finite Impulse Response filters), simply doesn't work reliably with 64-bit floats. So unless and until ARM supports something better than 64-bit floats it can't compete with x86 in some critical applications. It's ironic that Acorn's own early designs for a floating-point ARM coprocessor did support 80-bit floats, but that didn't survive integration with the main CPU. Admittedly Apple's Rosetta 2 emulation, which runs x86 code on the M1, does properly support 80-bit long doubles (in itself an impressive feat) and is a partial solution, but speed is obviously impacted quite significantly.

And https://developer.apple.com/forums/thread/673482