Closed ramhiser closed 9 years ago
Reproducible example from sims:
sample_sizes <- rep(25, 3)
test_sizes <- rep(10000, 3)
i <- 1
p <- 100
mu <- 0.5
rho <- rep(0.1, 3)
contamination_prob <- 0
block_size <- 100
set.seed(i)
mu1 <- rep(0, p)
mu2 <- c(rep(mu, block_size), rep(0, p - block_size))
mu3 <- -mu2
num_blocks <- p / block_size
train_data <- generate_contaminated(n=sample_sizes,
mu=cbind(mu1, mu2, mu3),
block_size=block_size,
num_blocks=num_blocks,
rho=rho,
uncontaminated_var=uncontaminated_var,
contaminated_var=contaminated_var,
contamination_prob=contamination_prob)
test_data <- generate_contaminated(n=test_sizes,
mu=cbind(mu1, mu2, mu3),
block_size=block_size,
num_blocks=num_blocks,
rho=autocorrelations,
uncontaminated_var=uncontaminated_var,
contaminated_var=contaminated_var,
contamination_prob=contamination_prob)
test_x <- test_data$x
test_y <- test_data$y
num_classes <- nlevels(train_y)
prior_probs <- rep(1, num_classes) / num_classes
hdrda_ridge_errors <- try({
cv_out <- hdrda_cv(x = train_x,
y = train_y,
prior = prior_probs)
hdrda_ridge <- list(lambda = cv_out$lambda, gamma = cv_out$gamma)
mean(predict(cv_out, test_x)$class != test_y)
})
It turns out the culprit is in sparsediscrim:::predict.hdrda
. The line quad_forms <- diag(quadform(class_est$W_inv, U1_x))
is slow because a 30K x 30K
matrix is being constructed before the diagonal elements are extracted. 30K
corresponds to the 30,000 test observations.
When stepping through predict.hdrda
, I computed the following:
Browse[2]> dim(U1_x)
[1] 72 30000
Browse[2]> dim(class_est$W_inv)
[1] 72 72
Browse[2]> quad_forms <- diag(quadform(class_est$W_inv, U1_x))
Browse[2]> foo <- apply(U1_x, 2, function(z) quadform(class_est$W_inv, z))
!Browse[2]> summary(quad_forms - foo)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.132e-14 -1.776e-15 0.000e+00 -4.460e-18 1.776e-15 3.553e-14
Practically speaking, there is no difference between computing the quadratic form of the entire matrix vs doing this for each vector in turn. The latter is substantially faster though. While stepping through I ran a quick benchmark:
library(microbenchmark)
!Browse[2]> microbenchmark(matrix=diag(quadform(class_est$W_inv, U1_x)), vec=apply(U1_x, 2, function(z) quadform(class_est$W_inv, z)), times=20)
Unit: milliseconds
expr min lq mean median uq max neval cld
matrix 3924.7077 4400.3033 4499.3932 4433.4784 4668.4003 4856.3770 20 b
vec 267.7087 278.6358 302.1803 300.0553 316.4827 369.2671 20 a
The vector approach is roughly 15 times faster.
In some HDRDA sims, I noticed that the prediction of a large number of test observations was painfully slow after model selection. The model selection itself is quite fast, so it is counterintuitive that the easier prediction of test observations is the bottleneck.