quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
840 stars 188 forks source link

textmodel_NB(x, distribution = "Bernoulli") possible error #780

Closed kbenoit closed 7 years ago

kbenoit commented 7 years ago

See PR #777 and comments.

kbenoit commented 7 years ago

@thomasd2 Further discussion should take place here in this issue, so I am resuming the thread from #777 here.

Multinomial

To replicate Example 13.1 from Introduction to Information Retrieval:

With smoothing, as per the book:

txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")
(trainingset <- dfm(txt, tolower = FALSE))
## Document-feature matrix of: 4 documents, 6 features (62.5% sparse).
## 4 x 6 sparse Matrix of class "dfmSparse"
##     features
## docs Chinese Beijing Shanghai Macao Tokyo Japan
##   d1       2       1        0     0     0     0
##   d2       2       0        1     0     0     0
##   d3       1       0        0     1     0     0
##   d4       1       0        0     0     1     1

# fit the model
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
(nb.p261 <- textmodel_NB(trainingset, trainingclass, prior = "docfreq"))

# test for results from p261
identical(nb.p261$PwGc["Y", "Chinese"], 3/7)
## [1] TRUE
identical(nb.p261$PwGc["Y", "Tokyo"], 1/14)
## [1] TRUE
identical(nb.p261$PwGc["Y", "Japan"], 1/14)
## [1] TRUE
identical(nb.p261$PwGc["N", "Chinese"], 2/9)
## [1] TRUE
identical(nb.p261$PwGc["N", "Tokyo"], 2/9)
## [1] TRUE
identical(nb.p261$PwGc["N", "Japan"], 2/9)
## [1] TRUE

Without smoothing, as per your example:

## without smoothing
(nb.p261.nosmooth <- textmodel_NB(trainingset, trainingclass, prior = "docfreq", smooth = 0))
# Fitted Naive Bayes model:
# Call:
#         textmodel_NB(x = trainingset, y = trainingclass, smooth = 0, 
#     prior = "docfreq")
# 
# 
# Training classes and priors:
#    Y    N 
# 0.75 0.25 
# 
#            Likelihoods:       Class Posteriors:
# 6 x 4 Matrix of class "dgeMatrix"
#              Y         N         Y         N
# Chinese  0.625 0.3333333 0.8490566 0.1509434
# Beijing  0.125 0.0000000 1.0000000 0.0000000
# Shanghai 0.125 0.0000000 1.0000000 0.0000000
# Macao    0.125 0.0000000 1.0000000 0.0000000
# Tokyo    0.000 0.3333333 0.0000000 1.0000000
# Japan    0.000 0.3333333 0.0000000 1.0000000

# test for results from p261
identical(nb.p261.nosmooth$PwGc["Y", "Chinese"], 5/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["Y", "Tokyo"], 0/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["Y", "Japan"], 0/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Chinese"], 1/3)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Tokyo"], 1/3)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Japan"], 1/3)
## [1] TRUE

Bernoulli version

Here, the code simply converts each count to 1 if >0, and 0 if 0, and runs the multinomial model. There is no worked example but the math should be the same for the matrix of 0s and 1s.

kbenoit commented 7 years ago

And for the terms not counted in the IIR text, such as "Beijing" that @thomasd2 referred to in #777, the computation is not 4/26 but rather 2/14 = 1/7 for smoothing and 1/8 for no smoothing. And this is computed correctly.

# Beijing: Pr(Beijing|c) = (1 + 1)/(8 + 6) = 1/7 (smoothed)
identical(nb.p261$PwGc["Y", "Beijing"], 1/7)
## [1] TRUE
# Beijing: Pr(Beijing|c) = (1 + 0)/(8 + 0) = 1/8 (unsmoothed)
identical(nb.p261.nosmooth$PwGc["Y", "Beijing"], 1/8)
## [1] TRUE
thomasd2 commented 7 years ago

What I don't understand is why you add six in the smoothing case. Smoothing simply adds one to each cell, doesn't it?? Don't we want the share of words that are Beijing in all documents (3) of the yes class: # Beijing: Pr(Beijing|c) = ((1+1)+(0+1)+(0+1)/((8 + 3*6)) = 4/26

kbenoit commented 7 years ago

You're adding one for each word type (6), which increases the denominator used for each likelihood by 6, while you only add 1 to each numerator (one observation for a word per class).

# number of unique word types in the training set
length(unique(featnames(trainingset[1:4, ])))
## [1] 6
kbenoit commented 7 years ago

There's another good worked example in Jurafsky and Martin 2nd ed. if you want to work through it.

thomasd2 commented 7 years ago

Ok, got it. Apologies - my intuition had been that smoothing adds one to each cell of the dfm and then calculates the likelihood.

kbenoit commented 7 years ago

dfm_smooth() does, but textmodel_NB() collapses the dfm by training class group first, and then does the addition.

thomasd2 commented 7 years ago

Ok re the multinomial case - but did I understand it correctly that you closed the Bernoulli issue as well? (I re-installed tonight from github, version 0.9.9.68).

nb.bern.nosmooth <- textmodel_NB(trainingset, trainingclass,
                              distribution="Bernoulli",
                              prior="docfreq",
                              smooth=0)
nb.bern.smooth <- textmodel_NB(trainingset, trainingclass,
                                      distribution="Bernoulli",
                                prior="docfreq",
                                      smooth=1)

dBinary <- as.data.frame(trainingset)
dBinary[,] <- ifelse(dBinary[,] > 0, "yes", "no")
dBinary[,] <- lapply(dBinary[,], as.factor)

dBinary
library(e1071)
(nb.e.nosmooth <- e1071::naiveBayes(dBinary, trainingclass ,
                          laplace=0) )
nb.bern.nosmooth
nb.bern.nosmooth$PwGc[1,]
lapply(nb.e.nosmooth$tables, function(x)  x[2,] )
# quanteda solution needs to be multiplied by 2??

(nb.e.smooth <- e1071::naiveBayes(dBinary, trainingclass ,
                                    laplace=1) )
nb.bern.smooth
nb.bern.smooth$PwGc[1,]
lapply(nb.e.smooth$tables, function(x)  x[2,] )
kbenoit commented 7 years ago

You're right @thomasd2 I had Bernoulli programmed wrongly, esp the prediction. I think it matches the algorithm and output in IIR now. See the new files in branch fix-776.