Parallel computation - Githubissues

shabbychef commented 8 years ago

So RcppParallel is not cutting it for me. There has to be another way to speed these things up.

AndreMikulec commented 5 years ago

Steven Pav,

I tried reading (I partially understood)

Numerically Stable, Single-Pass,
Parallel Statistics Algorithms
Janine Bennett
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.214.8508&rep=rep1&type=pdf

and I noticed

// preallocate the Binomial Coefficients, for efficiency
// this is R code used to generate C code. [ducks] 
https://github.com/shabbychef/fromo/blob/dev/src/common.h

What is it(magic/secret/genius_ideas) that makes "fromo" so much faster than everything else (besides what I read in the /README.html)?

Does "preallocate the Binomial Coefficients" make fromo faster?

Thanks, Andre

shabbychef commented 5 years ago

For the most part the speed advantage of fromo comes from:

using the right algorithm.
computing only what is needed.

From what I can tell, both the roll and RcppRoll packages recompute statistics entirely anew for each movement of the 'rolling window'. This is incredibly wasteful, and results in runtimes that grow with window size. On the other hand, RollingWindow and fromo reuse the computations, and only require a constant amount of work per observation. Here is a demonstration of that, on the running (err, 'rolling') standard deviation calculation:

library(roll)
library(RcppRoll)
library(RollingWindow)
library(fromo)
library(microbenchmark)
set.seed(1234)
xm <- matrix(rnorm(1e4),ncol=1)

options(width=180) 
microbenchmark(sd(xm),
               roll::roll_sd(xm,10,min_obs=3),
               roll::roll_sd(xm,100,min_obs=3),
               roll::roll_sd(xm,1000,min_obs=3),
               RcppRoll::roll_sd(xm,10),
               RcppRoll::roll_sd(xm,100),
               RcppRoll::roll_sd(xm,1000),
               RollingWindow::RollingStd(xm,10),
               RollingWindow::RollingStd(xm,100),
               RollingWindow::RollingStd(xm,1000),
               fromo::running_sd(xm,10,min_df=3,restart_period=1e5),
               fromo::running_sd(xm,100,min_df=3,restart_period=1e5),
               fromo::running_sd(xm,1000,min_df=3,restart_period=1e5))

Unit: microseconds
                                                            expr       min         lq       mean     median         uq        max neval  cld
                                                          sd(xm)    66.785    87.4030   105.2299   103.7165   116.2375    196.383   100 a   
                              roll::roll_sd(xm, 10, min_obs = 3)   312.784   381.2520   438.4409   398.2765   440.4605   1904.501   100 a   
                             roll::roll_sd(xm, 100, min_obs = 3)  1546.780  1593.7530  1646.9366  1618.1175  1657.4775   2425.043   100 a   
                            roll::roll_sd(xm, 1000, min_obs = 3) 13009.747 13149.5125 13507.1326 13292.4655 13472.2305  16581.365   100   c 
                                       RcppRoll::roll_sd(xm, 10)  1203.263  1337.2380  2370.5459  1410.0520  1671.3985  70737.574   100 a   
                                      RcppRoll::roll_sd(xm, 100)  4965.116  5492.6245  6054.9059  5768.3595  6676.6210   9114.550   100  b  
                                     RcppRoll::roll_sd(xm, 1000) 34780.605 36923.9845 44344.0407 38597.1805 41755.9890 118114.424   100    d
                               RollingWindow::RollingStd(xm, 10)   217.765   241.7990   356.7557   279.2195   313.1160   2029.249   100 a   
                              RollingWindow::RollingStd(xm, 100)   216.596   243.5740   391.0940   274.9170   318.5725   1981.543   100 a   
                             RollingWindow::RollingStd(xm, 1000)   207.704   240.8450   330.0146   269.1445   316.8970   1807.642   100 a   
   fromo::running_sd(xm, 10, min_df = 3, restart_period = 1e+05)   128.086   140.0570   164.3064   160.3235   178.6625    269.923   100 a   
  fromo::running_sd(xm, 100, min_df = 3, restart_period = 1e+05)   128.517   142.3745   173.3638   159.4400   177.8725   1120.217   100 a   
 fromo::running_sd(xm, 1000, min_df = 3, restart_period = 1e+05)   130.263   139.3750   193.9401   152.9830   179.6595   2102.280   100 a

sessionInfo()

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-2.1 fromo_0.2.0.001        RollingWindow_0.2     
[4] RcppRoll_0.2.2         roll_1.0.7            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0          assertthat_0.2.0    dplyr_0.7.8        
 [4] R6_2.3.0            grid_3.3.2          plyr_1.8.4         
 [7] gtable_0.2.0        magrittr_1.5        scales_1.0.0       
[10] RcppParallel_4.3.20 ggplot2_3.1.0       pillar_1.2.1       
[13] rlang_0.3.0.1       lazyeval_0.2.1      bindrcpp_0.2.2     
[16] glue_1.3.0          purrr_0.2.5         munsell_0.5.0      
[19] pkgconfig_2.0.2     colorspace_1.3-2    tidyselect_0.2.5   
[22] bindr_0.1.1         tibble_1.4.2

For some statistics, recomputing for every window is far far simpler than an algorithm that reuses computations. The running median is one such statistic, and fromo only offers an approximate median, based on moments and the Cornish Fisher approximation (which could be terrible).

As far as the second point, only computing what is required, I shifted a lot of options into template parameters. This means that in the case where the user has set na_rm=FALSE, the code does not have to check the na_rm parameter at every step in the loop to see whether the value has to be checked for NA. Instead, the compiler never adds that check because na_rm is a template parameter. The downside of this is that compilation takes much longer than I would like, and the resultant object is much larger than I would like: every choice of input parameters gets routed to the most slimmed down version of the code required, but that's a lot of code.

Finally, it is not at all clear to me that the 'summarizing' moments (as opposed to running moments) that fromo produces are computed by the optimal algorithm. I suspect that a single pass to compute the mean, then a second pass to compute higher order centered sums (even with Kahan's compensated summation) will probably be faster than Welford's method, which is what is currently used. (This should be addressed by #28.) Some timings indicate there are still improvements to be made:

library(moments)

xm <- matrix(rnorm(1e6),ncol=1)

options(width=180) 
microbenchmark(sd(xm),
               moments::skewness(xm),
               moments::kurtosis(xm),
               fromo::skew4(xm),
               fromo::kurt5(xm))

Unit: milliseconds
                  expr        min         lq      mean     median         uq       max neval  cld
                sd(xm)   5.689236   6.860844  13.31640   8.083677   9.343979  82.86736   100 a   
 moments::skewness(xm)  90.922670  97.368995 112.33575 102.015251 106.746956 208.96283   100   c 
 moments::kurtosis(xm)  91.224914  96.869382 117.62271 102.156984 119.950991 242.23187   100   c 
      fromo::skew4(xm)  79.295108  82.418505  86.34846  84.377049  87.876435 111.49282   100  b  
      fromo::kurt5(xm) 147.876266 155.700950 163.47877 160.916013 164.667530 211.34752   100    d

AndreMikulec commented 5 years ago

@shabbychef,

Thanks, for the awesome answer.

My mind is blown on the performance. Your answer has made me think more about global optimization. Sometime in the near future, I may try your methods, (copy, modify sightly, and paste) and try to make a rolling Sortino ratio.

Again, thanks for the great answer.

shabbychef commented 5 years ago

I am afraid that Sortino Ratio likely falls into the category of 'hard to create an update formula', like the median, and you might be better off using the naive algorithm of recomputing on every window. The reason is that it will be difficult to determine which old observations fall below the new, updated mean in order to compute the new downside semivariation.

I also have to say, that as a person who has studied the Sharpe ratio extensively (cf Short Sharpe Course, sharperat.io, the SharpeR package, etc.), I do not find the Sortino ratio very useful. I can think of no statistical question in performance testing which is answered by the Sortino. If you find that you must take skewness into account, I would suggest Yong Bao's skewness correction to the Sharpe, or Mertens' form of the standard error for testing, or even Bailey & Lopez de Prado's weird Frankenstein heuristic, before resorting to the Sortino.

AndreMikulec commented 5 years ago

@shabbychef

Sortino Ratio likely falls into the category of 
'hard to create an update formula'

Yes, I have and had some idea that a one-to-one copy and paste would not work. Maybe I will (inefficiently) do multiple scans: one with (typically zero(0)) below threshold observations 'zero-ed out' and one without. This would make the program (1/2) fast (or 2x slow whichever way one wants to look at it.). Also, the program would use twice as much memory. .

I do not find the Sortino ratio very useful.  
I can think of no statistical question in performance 
testing which is answered by the Sortino.

Mathematically, yes, better ratio's exist.

However, my interests are different and more 'emotional.' I am trying to avoid large quick drawdowns: e.g. big crashes (like stock market crashes) while not caring about large upswings (like stock market booms).

Thanks, Andre

shabbychef commented 5 years ago

Well, while we are off topic, I might suggest some work I did on generalizing the Sharpe that can be couched in terms of running moments, and so could be computed via fromo functions.

AndreMikulec commented 5 years ago

@shabbychef

Thanks, I am reading that pdf now.

shabbychef / fromo

Parallel computation #4