Closed ramhiser closed 7 years ago
A note about klar::rda
:
In my original comparison, I manually performed model selection over a grid and ignored the model's internal model selection, which uses simulated annealing (I think). It makes sense to do both.
I did a test run for 10 reps with p = 50, 100, 500
. For p = 500
, on average we are 14.511 times faster than klaR::rda
's grid model selection and 24.942 times faster than klaR::rda
's (default) Nelder-Mead model selection.
Here are the results:
Timing results for p = 50
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 302.9344 318.9448
rda_nelder_mead(x = data$x, y = data$y) 213.4164 217.7904
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 378.5714 395.3718
mean median uq max neval cld
332.2021 327.6307 344.7057 377.9277 10 b
227.6632 225.4409 230.3000 265.1779 10 a
405.7496 402.4990 417.7557 436.0770 10 c
Timing results for p = 100
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 711.4289 720.7940
rda_nelder_mead(x = data$x, y = data$y) 503.7918 513.5158
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 542.1044 555.7146
mean median uq max neval cld
751.3559 730.7232 782.5405 841.2667 10 c
530.2411 533.8257 541.0169 554.6028 10 a
567.0436 562.0882 571.8840 614.2932 10 b
Timing results for p = 500
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 9280.9039
rda_nelder_mead(x = data$x, y = data$y) 14464.8607
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 657.8144
lq mean median uq max neval cld
9811.4476 9938.9336 10002.6009 10125.9678 10576.8093 10 b
14971.4669 17083.8312 17317.1881 19138.0878 20228.6897 10 c
671.9081 684.9347 674.3166 683.6875 734.3979 10 a
I extended the small run to p = 3000
on my Mac. I fit a polynomial to the average runtimes. Quadratic fit the best in a rough analysis. Bottom line is that p = 10K
like I had set along with 100 reps is wayyy too long. The model fit for rda_grid
model selection at p = 10K
suggests that it will take 73.1 minutes for a singe rep to be run. Hence, 100 reps would take over 100 hours.
Below is the code with the results. I'll post a table of ETAs for various values of p
afterwards.
library(dplyr)
p <- c(50, 100, 500, 1000, 1500, 2000, 2500, 3000)
rda_grid <- c(332.2021, 751.3559, 9717.9715, 48016.703, 110732.902, 184367.979, 304004.983, 407329.452)
rda_nelder_mead <- c(227.6632, 530.2411, 15780.5566, 45167.128, 128249.991, 249724.798, 338583.316, 716525.601)
df <- data_frame(p, rda_grid, rda_nelder_mead)
lm_out <- lm(rda_grid ~ p + I(p^2), data=df)
summary(lm_out)
Call:
lm(formula = rda_grid ~ p + I(p^2), data = df)
Residuals:
1 2 3 4 5 6 7 8
2101 1690 -3704 -2650 1384 -5102 12977 -6695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.385e+03 5.241e+03 -0.455 0.668
p 1.018e+01 9.277e+00 1.097 0.323
I(p^2) 4.288e-02 3.089e-03 13.879 3.49e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7337 on 5 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9977
F-statistic: 1531 on 2 and 5 DF, p-value: 1.073e-07
Fit:
plot(p, rda_grid / 1000, ylab="Model Selection Time (seconds)")
lines(p, predict(lm_out, df) / 1000)
The goal now is to figure out the range of p
. Obviously, the predicted runtimes will be off because the AWS instance will be a lot faster than my MBP but not so much faster as to lead to a bad decision.
Here are the predicted runtimes and cumulative runtimes (in hours) for p
from 500 to 10K for 50 reps.
> predicted_runtime <- data_frame(p=seq(500, 10000, by=500))
> predicted_runtime %>% mutate(
+ runtime=predict(lm_out, predicted_runtime) / 1000 / 60 / 60 * 50,
+ cumulative_runtime=cumsum(runtime)
+ )
# A tibble: 20 x 3
p runtime cumulative_runtime
<dbl> <dbl> <dbl>
1 500 0.1864171 0.1864171
2 1000 0.7037046 0.8901216
3 1500 1.5187408 2.4088624
4 2000 2.6315257 5.0403881
5 2500 4.0420592 9.0824473
6 3000 5.7503414 14.8327887
7 3500 7.7563723 22.5891610
8 4000 10.0601518 32.6493128
9 4500 12.6616801 45.3109929
10 5000 15.5609570 60.8719499
11 5500 18.7579825 79.6299324
12 6000 22.2527568 101.8826892
13 6500 26.0452797 127.9279689
14 7000 30.1355513 158.0635201
15 7500 34.5235715 192.5870917
16 8000 39.2093405 231.7964321
17 8500 44.1928580 275.9892902
18 9000 49.4741243 325.4634145
19 9500 55.0531393 380.5165537
20 10000 60.9299029 441.4464566
I'm expecting the AWS instance to be roughly twice as fast as the numbers above. With that in mind, p = 3000
seems reasonable to run today. If the results are compelling enough, maybe we'll extend it to p = 5000
. Beyond p = 5000
, the numbers get absurd.
It turns out that klaR::rda
's built-in model selection method (Nelder-Mead) is wayyyyyy too slow. It's causing the comparisons to take wayyyyyy too long. It took ~6 hours for p = 1000
alone. And because it's so slow, we won't be comparing apples to apples.
I'm gonna stop the current run and remove the Nelder Mead method.
Here are the timing comparisons so far for the current run.
Timing results for p = 25
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 166.7981 173.4888
rda_nelder_mead(x = data$x, y = data$y) 121.2069 123.9678
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 401.4286 406.2318
mean median uq max neval cld
177.4155 177.4222 181.4293 187.4239 50 b
127.1385 125.5889 128.8486 144.1196 50 a
425.3230 411.5172 419.4768 640.6287 50 c
Timing results for p = 50
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 329.6158 333.4684
rda_nelder_mead(x = data$x, y = data$y) 246.4574 248.5073
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 486.7384 494.8719
mean median uq max neval cld
338.0469 336.9806 341.1416 352.9748 50 b
259.3514 250.3034 252.8912 462.3342 50 a
514.8512 497.7813 501.9355 713.9620 50 c
Timing results for p = 75
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 522.6468 532.2576
rda_nelder_mead(x = data$x, y = data$y) 457.3512 463.4883
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 667.8398 678.2728
mean median uq max neval cld
555.4924 540.1379 547.7288 749.6618 50 b
501.7416 466.3876 475.4662 680.7694 50 a
696.7978 681.6296 686.2834 883.9513 50 c
Timing results for p = 100
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 768.8455 774.9565
rda_nelder_mead(x = data$x, y = data$y) 792.5481 799.3014
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 768.2032 777.7098
mean median uq max neval cld
808.2714 779.6194 786.5972 990.6382 50 a
908.2929 986.4841 993.3737 1185.5456 50 b
831.4271 783.0207 970.0916 984.4053 50 a
Timing results for p = 200
Unit: milliseconds
expr min lq
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 2253.2383 2324.1852
rda_nelder_mead(x = data$x, y = data$y) 3759.9025 3973.2389
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 816.3057 825.2693
mean median uq max neval cld
2387.6815 2365.2438 2430.2737 2648.539 50 b
4058.2901 4007.4673 4170.6094 4388.560 50 c
844.4118 832.7054 841.7421 1039.984 50 a
Timing results for p = 300
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 4437.3806
rda_nelder_mead(x = data$x, y = data$y) 10563.5592
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 844.3956
lq mean median uq max neval cld
4683.521 4777.2619 4794.5154 4886.7757 5090.799 50 b
10742.756 10880.5716 10834.3666 11006.7033 11497.388 50 c
857.224 870.0757 864.3084 872.6231 1076.736 50 a
Timing results for p = 400
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 7214.0599
rda_nelder_mead(x = data$x, y = data$y) 22243.2571
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 879.0242
lq mean median uq max neval cld
7386.8681 7480.669 7457.9337 7569.9636 7757.649 50 b
22481.7749 22762.961 22719.9730 22914.2272 24108.446 50 c
896.4008 912.978 903.8256 915.1915 1129.542 50 a
Timing results for p = 500
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 10931.1523
rda_nelder_mead(x = data$x, y = data$y) 91627.8542
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 912.0222
lq mean median uq max neval cld
11132.5595 11320.5194 11289.37 11430.5520 11937.234 50 b
110765.0159 121049.9614 111228.56 123169.7214 239535.627 50 c
929.1886 953.6447 947.63 959.7757 1203.488 50 a
Timing results for p = 600
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 15469.4857
rda_nelder_mead(x = data$x, y = data$y) 67700.0798
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 925.7054
lq mean median uq max neval cld
16053.2008 16486.3637 16492.3145 16865.1229 17606.979 50 b
68387.1059 77334.8987 68840.2263 92768.8431 94225.259 50 c
959.7151 986.7141 985.6813 999.2199 1203.994 50 a
Timing results for p = 700
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 20297.8502
rda_nelder_mead(x = data$x, y = data$y) 105023.0024
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 964.7427
lq mean median uq max neval cld
22416.466 23248.295 23211.53 23857.322 25249.517 50 b
106006.785 125815.642 143516.62 144211.554 146250.634 50 c
1000.499 1017.917 1017.72 1030.805 1090.096 50 a
Timing results for p = 800
Unit: milliseconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 28739.8793
rda_nelder_mead(x = data$x, y = data$y) 152367.4051
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 984.7188
lq mean median uq max neval cld
30218.735 31671.361 31745.941 32765.859 34933.687 50 b
210230.413 201202.730 210705.453 211447.844 233604.329 50 c
1031.325 1054.522 1055.294 1077.625 1191.814 50 a
Timing results for p = 900
Unit: seconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 36.403024
rda_nelder_mead(x = data$x, y = data$y) 295.438161
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 1.015093
lq mean median uq max neval cld
39.695899 42.278286 41.926580 43.643404 52.235457 50 b
296.139261 304.318613 296.562225 297.306392 410.772806 50 c
1.038894 1.122133 1.106662 1.135982 1.734365 50 a
Timing results for p = 1000
Unit: seconds
expr min
rda_grid(x = data$x, y = data$y, num_tuning = grid_size) 40.114664
rda_nelder_mead(x = data$x, y = data$y) 398.166142
hdrda_grid(x = data$x, y = data$y, num_tuning = grid_size) 1.041241
lq mean median uq max neval cld
46.654238 47.978619 48.422800 49.598851 54.821087 50 b
399.484582 413.853325 400.147146 401.033161 538.301514 50 c
1.094524 1.157984 1.159823 1.199198 1.457924 50 a
I ran the timing comparisons on an AWS instance, but I couldn't SSH into it. It turns out that my local SSH config was corrupted. After fixing that and rebooting, I logged in without a problem.
The results include up to p = 5000
, where HDRDA is ~645x faster than RDA. :)
Quick and dirty linear model and fitted lines:
timings <- bind_rows(lapply(timing_results, as.data.frame), .id="p") %>%
mutate(p=as.integer(p),
expr=as.character(expr),
Classifier=ifelse(startsWith(expr, "hdrda"), "HDRDA", "RDA"),
time=time / 1e9) %>%
select(-expr) %>%
filter(p %% 500 == 0) %>%
tbl_df
lm_out <- lm(time ~ poly(p, 2) * Classifier, data=timings)
summary(lm_out)
Call:
lm(formula = time ~ poly(p, 2) * Classifier, data = timings)
Residuals:
Min 1Q Median 3Q Max
-253.843 -0.590 -0.037 1.740 289.387
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.050e+00 1.762e+00 1.163 0.245
poly(p, 2)1 2.547e+01 7.881e+01 0.323 0.747
poly(p, 2)2 4.877e-01 7.881e+01 0.006 0.995
ClassifierRDA 5.796e+02 2.492e+00 232.552 <2e-16 ***
poly(p, 2)1:ClassifierRDA 2.119e+04 1.115e+02 190.143 <2e-16 ***
poly(p, 2)2:ClassifierRDA 4.764e+03 1.115e+02 42.741 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 55.73 on 1994 degrees of freedom
Multiple R-squared: 0.9849, Adjusted R-squared: 0.9849
F-statistic: 2.604e+04 on 5 and 1994 DF, p-value: < 2.2e-16
plot(timings$p, timings$time, ylab="Model Selection Time (seconds)")
lines(unique(timings$p), predict(lm_out, data_frame(p=unique(timings$p), Classifier="RDA")), col="red")
lines(unique(timings$p), predict(lm_out, data_frame(p=unique(timings$p), Classifier="HDRDA")), col="blue")
I was curious whether the obvious heteroscedasticity was causing the HDRDA coefficients to be statistically insignificant. Also, the fit appears pretty good already with a residual standard error of 55.73.
It turns out by weighting each observation by the inverse standard deviation at each pair of p
and Classifier
, the fit does improve with a residual standard error of 5.404, and the HDRDA linear coefficients are significant. The quadratic coefficient for HDRDA is insignificant, which makes sense because the computational runtime is linear in the number of features p
.
The actual fit hardly changes when I zoom into the HDRDA points. Blue is OLS, Green is WLS.
Here's the code and summary for the WLS fit:
timings <- bind_rows(lapply(timing_results, as.data.frame), .id="p") %>%
mutate(p=as.integer(p),
expr=as.character(expr),
Classifier=ifelse(startsWith(expr, "hdrda"), "HDRDA", "RDA"),
time=time / 1e9) %>%
select(-expr) %>%
filter(p %% 500 == 0) %>%
tbl_df %>%
group_by(p, Classifier) %>%
mutate(wls_weights=1/sd(time)) %>%
ungroup
lm_out <- lm(time ~ poly(p, 2) * Classifier, data=timings, weights=wls_weights)
summary(lm_out)
Call:
lm(formula = time ~ poly(p, 2) * Classifier, data = timings,
weights = wls_weights)
Weighted Residuals:
Min 1Q Median 3Q Max
-19.7943 -0.6407 -0.1533 1.1144 22.2777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0362 0.1323 15.394 < 2e-16 ***
poly(p, 2)1 25.3263 6.2405 4.058 5.13e-05 ***
poly(p, 2)2 -0.6585 5.1516 -0.128 0.898
ClassifierRDA 579.4187 1.2934 447.983 < 2e-16 ***
poly(p, 2)1:ClassifierRDA 21193.8551 62.1210 341.171 < 2e-16 ***
poly(p, 2)2:ClassifierRDA 4815.3086 36.0825 133.453 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.404 on 1994 degrees of freedom
Multiple R-squared: 0.9904, Adjusted R-squared: 0.9904
F-statistic: 4.104e+04 on 5 and 1994 DF, p-value: < 2.2e-16
Previously, I coded up a timing comparison, comparing
simdiag::hdrda
vsklaR::rda
. We'll bring that back with a more exhaustive comparison and also include thePenalizedLDA::PenalizedLDA
from Witten and Tibshirani (2011).I'll also add the diagonal classifiers but will likely forego including them in the paper. We will need to remark that loosely justifies this. Frankly, the diagonal classifiers will be much faster because they are simpler, but at the cost of classification accuracy.
In the Witten and Tibshirani (2011) paper, they also perform a timing comparison with 4 populations with varying feature dimensions:
p=20
p=200
p=2000
p=20000
They perform the timing comparison over 25 repetitions and report the mean and standard deviation of the runtimes. We'll do something similar but with a lot more repetitions.