Closed kbenoit closed 4 years ago
Merging #21 into master will decrease coverage by
0.01%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #21 +/- ##
==========================================
- Coverage 87.75% 87.73% -0.02%
==========================================
Files 16 16
Lines 1600 1598 -2
==========================================
- Hits 1404 1402 -2
Misses 196 196
Impacted Files | Coverage Δ | |
---|---|---|
R/textmodel_nb.R | 97.93% <100%> (-0.05%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d0c23a7...207c717. Read the comment docs.
Appveyor failing apparently because this package was just updated, and the Windows builds of 0.9.7 are not yet completed. https://cran.r-project.org/web/checks/check_results_naivebayes.html
So the test failures are "false".
Hi, as an apology for causing problems with the newest update of the naivebayes package :) (handling of sparse matrices), I have some suggestions for furthermore boosting the performance of your predict function:
# presence_prob <- newdata %*% t(present)
presence_prob <- tcrossprod(newdata, present)
# nonpresence_prob <- matrix(base::colSums(t(nonpresent)),
# nrow = nrow(presence_prob),
# ncol = ncol(presence_prob), byrow = TRUE) - newdata %*% t(nonpresent)
nonpresence_prob <- matrix(base::colSums(t(nonpresent)),
nrow = nrow(presence_prob),
ncol = ncol(presence_prob), byrow = TRUE) - tcrossprod(newdata, nonpresent)
Furthermore, instead of adding log prior row-wise to the log-likelihood using apply and then transposing, I would suggest much more efficient and scalable solution:
# logpos <- t(apply(loglik, 1, "+", log(object$Pc)))
for (ith_col in seq_along(object$Pc))
loglik[,ith_col] <- loglik[,ith_col] + object$Pc[ith_col]
# library(microbenchmark)
# rows <- 100000
# nclass <- 2
# M <- matrix(rnorm(rows * nclass), ncol = nclass, nrow = rows)
#
# M_new <- matrix(nrow = rows, ncol = nclass)
# p <- seq_len(nclass) * 100
#
#
# microbenchmark::microbenchmark(
# current = t(apply(M, 1, "+", p)),
# proposed = for (i in seq_along(p)) M_new[,i] <- M[,i] + p[i]
# )
#
# Unit: milliseconds
# expr min lq mean median uq max neval
# current 67.8745 73.39565 80.387014 75.33025 77.6671 261.8406 100
# proposed 2.3670 2.54470 3.280408 2.59770 2.7047 13.1867 100
Also, another apply function can be replaced by max.col function
# classpred <- colnames(logpos)[apply(logpos, 1, which.max)]
classpred <- colnames(loglik)[max.col(loglik, "first")]
There is a danger here. If there would be one row in newdata and loglik would be then simplified to a vector during matrix manipulations then max.col would not work as intended. See max.col(c(1,2,3))
.
Finally
# base_lpl <- logpos[, j]
# result[, j] <- 1 / (1 + rowSums(exp(logpos[, -j, drop = FALSE] - base_lpl)))
base_lpl <- loglik[, j]
result[, j] <- 1 / (1 + rowSums(exp(loglik[, -j, drop = FALSE] - base_lpl)))
It could be possible to get rid of the result matrix, I suppose.
Unfortunately, I didn't have time to check if my suggestions would work "out-of-the-box" when fully applied, as I am on the go.
I hope this helps!
Best, Michal
@majkamichal that’s awesome, thanks so much! I’ll get this integrated asap! 🙏🙏🙏
@majkamichal even faster way to add the class priors to the rows of the log likelihoods matrix is to use column recycling on the transposed matrix:
library(microbenchmark)
rows <- 100000
nclass <- 2
M <- matrix(rnorm(rows * nclass), ncol = nclass, nrow = rows)
M_new <- matrix(nrow = rows, ncol = nclass)
p <- seq_len(nclass) * 100
microbenchmark::microbenchmark(
current = t(apply(M, 1, "+", p)),
proposed = {
for (i in seq_along(p)) M_new[, i] <- M[, i] + p[i]
M_new
},
mat = t(t(M) + p), # uses column recycling
check = "equal"
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## current 81.378103 92.044730 103.066693 96.044443 109.108772 157.696809 100
## proposed 2.685612 3.117284 3.689511 3.253839 3.536924 9.458964 100
## mat 1.138189 1.237222 1.554418 1.280898 1.406442 7.142249 100
@kbenoit Ahhh now I remember why I decided to use a for-loop instead of transpositions + recycling (I used the latter method in the general predict.naive_bayes
function (it has a different design) till I found a more efficient solution.). The loop scales a bit better when there is a big number of row as well as columns (classes) - this was the reason why I went for it in specialized predict functions. However, most of classification is done with 2 classes and the advantage is never seen. I will also use transposition method. Thanks!
Solves #3.
Improves efficiency and streamlines the code of
textmodel_nb()
, including a bug fix affecting the smoothing denominator when the number of classes was > 2.Also adds a vignette that compares performance, and will be expanded later to include other models (svmlin for instance).