quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Fix Bernoulli textmodel_nb #21

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago

Solves #3.

Improves efficiency and streamlines the code of textmodel_nb(), including a bug fix affecting the smoothing denominator when the number of classes was > 2.

Also adds a vignette that compares performance, and will be expanded later to include other models (svmlin for instance).

codecov-io commented 4 years ago

Codecov Report

Merging #21 into master will decrease coverage by 0.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #21      +/-   ##
==========================================
- Coverage   87.75%   87.73%   -0.02%     
==========================================
  Files          16       16              
  Lines        1600     1598       -2     
==========================================
- Hits         1404     1402       -2     
  Misses        196      196
Impacted Files Coverage Δ
R/textmodel_nb.R 97.93% <100%> (-0.05%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d0c23a7...207c717. Read the comment docs.

kbenoit commented 4 years ago

Appveyor failing apparently because this package was just updated, and the Windows builds of 0.9.7 are not yet completed. https://cran.r-project.org/web/checks/check_results_naivebayes.html

So the test failures are "false".

majkamichal commented 4 years ago

Hi, as an apology for causing problems with the newest update of the naivebayes package :) (handling of sparse matrices), I have some suggestions for furthermore boosting the performance of your predict function:

# presence_prob <- newdata %*% t(present)
presence_prob <- tcrossprod(newdata, present)

# nonpresence_prob <- matrix(base::colSums(t(nonpresent)),
#                              nrow = nrow(presence_prob),
#                              ncol = ncol(presence_prob), byrow = TRUE) - newdata %*% t(nonpresent)

nonpresence_prob <- matrix(base::colSums(t(nonpresent)),
                           nrow = nrow(presence_prob),
                           ncol = ncol(presence_prob), byrow = TRUE) - tcrossprod(newdata, nonpresent) 

Furthermore, instead of adding log prior row-wise to the log-likelihood using apply and then transposing, I would suggest much more efficient and scalable solution:

# logpos <- t(apply(loglik, 1, "+", log(object$Pc)))

for (ith_col in seq_along(object$Pc)) 
  loglik[,ith_col] <- loglik[,ith_col] + object$Pc[ith_col]

# library(microbenchmark)
# rows <- 100000
# nclass <- 2
# M <- matrix(rnorm(rows * nclass), ncol = nclass, nrow = rows)
# 
# M_new <- matrix(nrow = rows, ncol = nclass)
# p <- seq_len(nclass) * 100
# 
# 
# microbenchmark::microbenchmark(
#   current = t(apply(M, 1, "+", p)),
#   proposed = for (i in seq_along(p)) M_new[,i] <- M[,i] + p[i]
# )
#
# Unit: milliseconds
# expr          min         lq            mean          median    uq           max           neval
# current     67.8745  73.39565  80.387014  75.33025  77.6671  261.8406   100
# proposed  2.3670   2.54470    3.280408    2.59770    2.7047    13.1867     100

Also, another apply function can be replaced by max.col function

# classpred <- colnames(logpos)[apply(logpos, 1, which.max)]
classpred <- colnames(loglik)[max.col(loglik, "first")] 

There is a danger here. If there would be one row in newdata and loglik would be then simplified to a vector during matrix manipulations then max.col would not work as intended. See max.col(c(1,2,3)).

Finally

# base_lpl <- logpos[, j]
# result[, j] <- 1 / (1 + rowSums(exp(logpos[, -j, drop = FALSE] - base_lpl)))

base_lpl <- loglik[, j]
result[, j] <- 1 / (1 + rowSums(exp(loglik[, -j, drop = FALSE] - base_lpl)))

It could be possible to get rid of the result matrix, I suppose.

Unfortunately, I didn't have time to check if my suggestions would work "out-of-the-box" when fully applied, as I am on the go.

I hope this helps!

Best, Michal

kbenoit commented 4 years ago

@majkamichal that’s awesome, thanks so much! I’ll get this integrated asap! 🙏🙏🙏

kbenoit commented 4 years ago

@majkamichal even faster way to add the class priors to the rows of the log likelihoods matrix is to use column recycling on the transposed matrix:

library(microbenchmark)
rows <- 100000
nclass <- 2
M <- matrix(rnorm(rows * nclass), ncol = nclass, nrow = rows)
M_new <- matrix(nrow = rows, ncol = nclass)
p <- seq_len(nclass) * 100

microbenchmark::microbenchmark(
  current = t(apply(M, 1, "+", p)),
  proposed = {
    for (i in seq_along(p)) M_new[, i] <- M[, i] + p[i]
    M_new
  },
  mat = t(t(M) + p), # uses column recycling
  check = "equal"
)
## Unit: milliseconds
##      expr       min        lq       mean    median         uq        max neval
##   current 81.378103 92.044730 103.066693 96.044443 109.108772 157.696809   100
##  proposed  2.685612  3.117284   3.689511  3.253839   3.536924   9.458964   100
##       mat  1.138189  1.237222   1.554418  1.280898   1.406442   7.142249   100
majkamichal commented 4 years ago

@kbenoit Ahhh now I remember why I decided to use a for-loop instead of transpositions + recycling (I used the latter method in the general predict.naive_bayes function (it has a different design) till I found a more efficient solution.). The loop scales a bit better when there is a big number of row as well as columns (classes) - this was the reason why I went for it in specialized predict functions. However, most of classification is done with 2 classes and the advantage is never seen. I will also use transposition method. Thanks!