Add timing comparison - Githubissues

Previously, I coded up a timing comparison, comparing simdiag::hdrda vs klaR::rda. We'll bring that back with a more exhaustive comparison and also include the PenalizedLDA::PenalizedLDA from Witten and Tibshirani (2011).

I'll also add the diagonal classifiers but will likely forego including them in the paper. We will need to remark that loosely justifies this. Frankly, the diagonal classifiers will be much faster because they are simpler, but at the cost of classification accuracy.

In the Witten and Tibshirani (2011) paper, they also perform a timing comparison with 4 populations with varying feature dimensions:

p=20
p=200
p=2000
p=20000

They perform the timing comparison over 25 repetitions and report the mean and standard deviation of the runtimes. We'll do something similar but with a lot more repetitions.

A note about klar::rda:

In my original comparison, I manually performed model selection over a grid and ignored the model's internal model selection, which uses simulated annealing (I think). It makes sense to do both.

I did a test run for 10 reps with p = 50, 100, 500. For p = 500, on average we are 14.511 times faster than klaR::rda's grid model selection and 24.942 times faster than klaR::rda's (default) Nelder-Mead model selection.

Here are the results:

 Timing results for p = 50
 Unit: milliseconds
                                                        expr      min       lq
    rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 302.9344 318.9448
                     rda_nelder_mead(x = data$x, y = data$y) 213.4164 217.7904
  hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 378.5714 395.3718
      mean   median       uq      max neval cld
  332.2021 327.6307 344.7057 377.9277    10  b
  227.6632 225.4409 230.3000 265.1779    10 a
  405.7496 402.4990 417.7557 436.0770    10   c
 Timing results for p = 100
 Unit: milliseconds
                                                        expr      min       lq
    rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 711.4289 720.7940
                     rda_nelder_mead(x = data$x, y = data$y) 503.7918 513.5158
  hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 542.1044 555.7146
      mean   median       uq      max neval cld
  751.3559 730.7232 782.5405 841.2667    10   c
  530.2411 533.8257 541.0169 554.6028    10 a
  567.0436 562.0882 571.8840 614.2932    10  b
 Timing results for p = 500
 Unit: milliseconds
                                                        expr        min
    rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  9280.9039
                     rda_nelder_mead(x = data$x, y = data$y) 14464.8607
  hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   657.8144
          lq       mean     median         uq        max neval cld
   9811.4476  9938.9336 10002.6009 10125.9678 10576.8093    10  b
  14971.4669 17083.8312 17317.1881 19138.0878 20228.6897    10   c
    671.9081   684.9347   674.3166   683.6875   734.3979    10 a

I extended the small run to p = 3000 on my Mac. I fit a polynomial to the average runtimes. Quadratic fit the best in a rough analysis. Bottom line is that p = 10K like I had set along with 100 reps is wayyy too long. The model fit for rda_grid model selection at p = 10K suggests that it will take 73.1 minutes for a singe rep to be run. Hence, 100 reps would take over 100 hours.

Below is the code with the results. I'll post a table of ETAs for various values of p afterwards.

library(dplyr)
p <- c(50, 100, 500, 1000, 1500, 2000, 2500, 3000)
rda_grid <- c(332.2021, 751.3559, 9717.9715, 48016.703, 110732.902, 184367.979, 304004.983, 407329.452)
rda_nelder_mead <- c(227.6632, 530.2411, 15780.5566, 45167.128, 128249.991, 249724.798, 338583.316, 716525.601)

df <- data_frame(p, rda_grid, rda_nelder_mead)

lm_out <- lm(rda_grid ~ p + I(p^2), data=df)
summary(lm_out)

Call:
lm(formula = rda_grid ~ p + I(p^2), data = df)

Residuals:
    1     2     3     4     5     6     7     8 
 2101  1690 -3704 -2650  1384 -5102 12977 -6695 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.385e+03  5.241e+03  -0.455    0.668    
p            1.018e+01  9.277e+00   1.097    0.323    
I(p^2)       4.288e-02  3.089e-03  13.879 3.49e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7337 on 5 degrees of freedom
Multiple R-squared:  0.9984,    Adjusted R-squared:  0.9977 
F-statistic:  1531 on 2 and 5 DF,  p-value: 1.073e-07

Fit:

plot(p, rda_grid / 1000, ylab="Model Selection Time (seconds)")
lines(p, predict(lm_out, df) / 1000)

The goal now is to figure out the range of p. Obviously, the predicted runtimes will be off because the AWS instance will be a lot faster than my MBP but not so much faster as to lead to a bad decision.

Here are the predicted runtimes and cumulative runtimes (in hours) for p from 500 to 10K for 50 reps.

> predicted_runtime <- data_frame(p=seq(500, 10000, by=500))
> predicted_runtime %>% mutate(
+   runtime=predict(lm_out, predicted_runtime) / 1000 / 60 / 60 * 50,
+   cumulative_runtime=cumsum(runtime)
+ )
# A tibble: 20 x 3
       p    runtime cumulative_runtime
   <dbl>      <dbl>              <dbl>
1    500  0.1864171          0.1864171
2   1000  0.7037046          0.8901216
3   1500  1.5187408          2.4088624
4   2000  2.6315257          5.0403881
5   2500  4.0420592          9.0824473
6   3000  5.7503414         14.8327887
7   3500  7.7563723         22.5891610
8   4000 10.0601518         32.6493128
9   4500 12.6616801         45.3109929
10  5000 15.5609570         60.8719499
11  5500 18.7579825         79.6299324
12  6000 22.2527568        101.8826892
13  6500 26.0452797        127.9279689
14  7000 30.1355513        158.0635201
15  7500 34.5235715        192.5870917
16  8000 39.2093405        231.7964321
17  8500 44.1928580        275.9892902
18  9000 49.4741243        325.4634145
19  9500 55.0531393        380.5165537
20 10000 60.9299029        441.4464566

I'm expecting the AWS instance to be roughly twice as fast as the numbers above. With that in mind, p = 3000 seems reasonable to run today. If the results are compelling enough, maybe we'll extend it to p = 5000. Beyond p = 5000, the numbers get absurd.

It turns out that klaR::rda's built-in model selection method (Nelder-Mead) is wayyyyyy too slow. It's causing the comparisons to take wayyyyyy too long. It took ~6 hours for p = 1000 alone. And because it's so slow, we won't be comparing apples to apples.

I'm gonna stop the current run and remove the Nelder Mead method.

Here are the timing comparisons so far for the current run.

Timing results for p = 25
Unit: milliseconds
                                                       expr      min       lq
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 166.7981 173.4888
                    rda_nelder_mead(x = data$x, y = data$y) 121.2069 123.9678
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 401.4286 406.2318
     mean   median       uq      max neval cld
 177.4155 177.4222 181.4293 187.4239    50  b
 127.1385 125.5889 128.8486 144.1196    50 a
 425.3230 411.5172 419.4768 640.6287    50   c
Timing results for p = 50
Unit: milliseconds
                                                       expr      min       lq
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 329.6158 333.4684
                    rda_nelder_mead(x = data$x, y = data$y) 246.4574 248.5073
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 486.7384 494.8719
     mean   median       uq      max neval cld
 338.0469 336.9806 341.1416 352.9748    50  b
 259.3514 250.3034 252.8912 462.3342    50 a
 514.8512 497.7813 501.9355 713.9620    50   c
Timing results for p = 75
Unit: milliseconds
                                                       expr      min       lq
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 522.6468 532.2576
                    rda_nelder_mead(x = data$x, y = data$y) 457.3512 463.4883
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 667.8398 678.2728
     mean   median       uq      max neval cld
 555.4924 540.1379 547.7288 749.6618    50  b
 501.7416 466.3876 475.4662 680.7694    50 a
 696.7978 681.6296 686.2834 883.9513    50   c
Timing results for p = 100
Unit: milliseconds
                                                       expr      min       lq
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 768.8455 774.9565
                    rda_nelder_mead(x = data$x, y = data$y) 792.5481 799.3014
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 768.2032 777.7098
     mean   median       uq       max neval cld
 808.2714 779.6194 786.5972  990.6382    50  a
 908.2929 986.4841 993.3737 1185.5456    50   b
 831.4271 783.0207 970.0916  984.4053    50  a
Timing results for p = 200
Unit: milliseconds
                                                       expr       min        lq
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 2253.2383 2324.1852
                    rda_nelder_mead(x = data$x, y = data$y) 3759.9025 3973.2389
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)  816.3057  825.2693
      mean    median        uq      max neval cld
 2387.6815 2365.2438 2430.2737 2648.539    50  b
 4058.2901 4007.4673 4170.6094 4388.560    50   c
  844.4118  832.7054  841.7421 1039.984    50 a
Timing results for p = 300
Unit: milliseconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  4437.3806
                    rda_nelder_mead(x = data$x, y = data$y) 10563.5592
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   844.3956
        lq       mean     median         uq       max neval cld
  4683.521  4777.2619  4794.5154  4886.7757  5090.799    50  b
 10742.756 10880.5716 10834.3666 11006.7033 11497.388    50   c
   857.224   870.0757   864.3084   872.6231  1076.736    50 a
Timing results for p = 400
Unit: milliseconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  7214.0599
                    rda_nelder_mead(x = data$x, y = data$y) 22243.2571
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   879.0242
         lq      mean     median         uq       max neval cld
  7386.8681  7480.669  7457.9337  7569.9636  7757.649    50  b
 22481.7749 22762.961 22719.9730 22914.2272 24108.446    50   c
   896.4008   912.978   903.8256   915.1915  1129.542    50 a
Timing results for p = 500
Unit: milliseconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 10931.1523
                    rda_nelder_mead(x = data$x, y = data$y) 91627.8542
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   912.0222
          lq        mean    median          uq        max neval cld
  11132.5595  11320.5194  11289.37  11430.5520  11937.234    50  b
 110765.0159 121049.9614 111228.56 123169.7214 239535.627    50   c
    929.1886    953.6447    947.63    959.7757   1203.488    50 a
Timing results for p = 600
Unit: milliseconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 15469.4857
                    rda_nelder_mead(x = data$x, y = data$y) 67700.0798
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   925.7054
         lq       mean     median         uq       max neval cld
 16053.2008 16486.3637 16492.3145 16865.1229 17606.979    50  b
 68387.1059 77334.8987 68840.2263 92768.8431 94225.259    50   c
   959.7151   986.7141   985.6813   999.2199  1203.994    50 a
Timing results for p = 700
Unit: milliseconds
                                                       expr         min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  20297.8502
                    rda_nelder_mead(x = data$x, y = data$y) 105023.0024
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)    964.7427
         lq       mean    median         uq        max neval cld
  22416.466  23248.295  23211.53  23857.322  25249.517    50  b
 106006.785 125815.642 143516.62 144211.554 146250.634    50   c
   1000.499   1017.917   1017.72   1030.805   1090.096    50 a
Timing results for p = 800
Unit: milliseconds
                                                       expr         min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  28739.8793
                    rda_nelder_mead(x = data$x, y = data$y) 152367.4051
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)    984.7188
         lq       mean     median         uq        max neval cld
  30218.735  31671.361  31745.941  32765.859  34933.687    50  b
 210230.413 201202.730 210705.453 211447.844 233604.329    50   c
   1031.325   1054.522   1055.294   1077.625   1191.814    50 a
Timing results for p = 900
Unit: seconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  36.403024
                    rda_nelder_mead(x = data$x, y = data$y) 295.438161
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   1.015093
         lq       mean     median         uq        max neval cld
  39.695899  42.278286  41.926580  43.643404  52.235457    50  b
 296.139261 304.318613 296.562225 297.306392 410.772806    50   c
   1.038894   1.122133   1.106662   1.135982   1.734365    50 a
Timing results for p = 1000
Unit: seconds
                                                       expr        min
   rda_grid(x = data$x, y = data$y, num_tuning = grid_size)  40.114664
                    rda_nelder_mead(x = data$x, y = data$y) 398.166142
 hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size)   1.041241
         lq       mean     median         uq        max neval cld
  46.654238  47.978619  48.422800  49.598851  54.821087    50  b
 399.484582 413.853325 400.147146 401.033161 538.301514    50   c
   1.094524   1.157984   1.159823   1.199198   1.457924    50 a

I ran the timing comparisons on an AWS instance, but I couldn't SSH into it. It turns out that my local SSH config was corrupted. After fixing that and rebooting, I logged in without a problem.

The results include up to p = 5000, where HDRDA is ~645x faster than RDA. :)

Quick and dirty linear model and fitted lines:

timings <- bind_rows(lapply(timing_results, as.data.frame), .id="p") %>%
  mutate(p=as.integer(p),
         expr=as.character(expr),
         Classifier=ifelse(startsWith(expr, "hdrda"), "HDRDA", "RDA"),
         time=time / 1e9) %>%
  select(-expr) %>%
  filter(p %% 500 == 0) %>%
  tbl_df

lm_out <- lm(time ~ poly(p, 2) * Classifier, data=timings)
summary(lm_out)

Call:
lm(formula = time ~ poly(p, 2) * Classifier, data = timings)

Residuals:
     Min       1Q   Median       3Q      Max 
-253.843   -0.590   -0.037    1.740  289.387 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               2.050e+00  1.762e+00   1.163    0.245    
poly(p, 2)1               2.547e+01  7.881e+01   0.323    0.747    
poly(p, 2)2               4.877e-01  7.881e+01   0.006    0.995    
ClassifierRDA             5.796e+02  2.492e+00 232.552   <2e-16 ***
poly(p, 2)1:ClassifierRDA 2.119e+04  1.115e+02 190.143   <2e-16 ***
poly(p, 2)2:ClassifierRDA 4.764e+03  1.115e+02  42.741   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.73 on 1994 degrees of freedom
Multiple R-squared:  0.9849,    Adjusted R-squared:  0.9849 
F-statistic: 2.604e+04 on 5 and 1994 DF,  p-value: < 2.2e-16

plot(timings$p, timings$time, ylab="Model Selection Time (seconds)")
lines(unique(timings$p), predict(lm_out, data_frame(p=unique(timings$p), Classifier="RDA")), col="red")
lines(unique(timings$p), predict(lm_out, data_frame(p=unique(timings$p), Classifier="HDRDA")), col="blue")

I was curious whether the obvious heteroscedasticity was causing the HDRDA coefficients to be statistically insignificant. Also, the fit appears pretty good already with a residual standard error of 55.73.

It turns out by weighting each observation by the inverse standard deviation at each pair of p and Classifier, the fit does improve with a residual standard error of 5.404, and the HDRDA linear coefficients are significant. The quadratic coefficient for HDRDA is insignificant, which makes sense because the computational runtime is linear in the number of features p.

The actual fit hardly changes when I zoom into the HDRDA points. Blue is OLS, Green is WLS.

Here's the code and summary for the WLS fit:

timings <- bind_rows(lapply(timing_results, as.data.frame), .id="p") %>%
  mutate(p=as.integer(p),
         expr=as.character(expr),
         Classifier=ifelse(startsWith(expr, "hdrda"), "HDRDA", "RDA"),
         time=time / 1e9) %>%
  select(-expr) %>%
  filter(p %% 500 == 0) %>%
  tbl_df %>%
  group_by(p, Classifier) %>%
  mutate(wls_weights=1/sd(time)) %>%
  ungroup

lm_out <- lm(time ~ poly(p, 2) * Classifier, data=timings, weights=wls_weights)
summary(lm_out)

Call:
lm(formula = time ~ poly(p, 2) * Classifier, data = timings, 
    weights = wls_weights)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-19.7943  -0.6407  -0.1533   1.1144  22.2777 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   2.0362     0.1323  15.394  < 2e-16 ***
poly(p, 2)1                  25.3263     6.2405   4.058 5.13e-05 ***
poly(p, 2)2                  -0.6585     5.1516  -0.128    0.898    
ClassifierRDA               579.4187     1.2934 447.983  < 2e-16 ***
poly(p, 2)1:ClassifierRDA 21193.8551    62.1210 341.171  < 2e-16 ***
poly(p, 2)2:ClassifierRDA  4815.3086    36.0825 133.453  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.404 on 1994 degrees of freedom
Multiple R-squared:  0.9904,    Adjusted R-squared:  0.9904 
F-statistic: 4.104e+04 on 5 and 1994 DF,  p-value: < 2.2e-16

ramhiser / paper-hdrda

Add timing comparison #26