Closed kbenoit closed 7 years ago
@thomasd2 Further discussion should take place here in this issue, so I am resuming the thread from #777 here.
To replicate Example 13.1 from Introduction to Information Retrieval:
With smoothing, as per the book:
txt <- c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
(trainingset <- dfm(txt, tolower = FALSE))
## Document-feature matrix of: 4 documents, 6 features (62.5% sparse).
## 4 x 6 sparse Matrix of class "dfmSparse"
## features
## docs Chinese Beijing Shanghai Macao Tokyo Japan
## d1 2 1 0 0 0 0
## d2 2 0 1 0 0 0
## d3 1 0 0 1 0 0
## d4 1 0 0 0 1 1
# fit the model
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
(nb.p261 <- textmodel_NB(trainingset, trainingclass, prior = "docfreq"))
# test for results from p261
identical(nb.p261$PwGc["Y", "Chinese"], 3/7)
## [1] TRUE
identical(nb.p261$PwGc["Y", "Tokyo"], 1/14)
## [1] TRUE
identical(nb.p261$PwGc["Y", "Japan"], 1/14)
## [1] TRUE
identical(nb.p261$PwGc["N", "Chinese"], 2/9)
## [1] TRUE
identical(nb.p261$PwGc["N", "Tokyo"], 2/9)
## [1] TRUE
identical(nb.p261$PwGc["N", "Japan"], 2/9)
## [1] TRUE
Without smoothing, as per your example:
## without smoothing
(nb.p261.nosmooth <- textmodel_NB(trainingset, trainingclass, prior = "docfreq", smooth = 0))
# Fitted Naive Bayes model:
# Call:
# textmodel_NB(x = trainingset, y = trainingclass, smooth = 0,
# prior = "docfreq")
#
#
# Training classes and priors:
# Y N
# 0.75 0.25
#
# Likelihoods: Class Posteriors:
# 6 x 4 Matrix of class "dgeMatrix"
# Y N Y N
# Chinese 0.625 0.3333333 0.8490566 0.1509434
# Beijing 0.125 0.0000000 1.0000000 0.0000000
# Shanghai 0.125 0.0000000 1.0000000 0.0000000
# Macao 0.125 0.0000000 1.0000000 0.0000000
# Tokyo 0.000 0.3333333 0.0000000 1.0000000
# Japan 0.000 0.3333333 0.0000000 1.0000000
# test for results from p261
identical(nb.p261.nosmooth$PwGc["Y", "Chinese"], 5/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["Y", "Tokyo"], 0/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["Y", "Japan"], 0/8)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Chinese"], 1/3)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Tokyo"], 1/3)
## [1] TRUE
identical(nb.p261.nosmooth$PwGc["N", "Japan"], 1/3)
## [1] TRUE
Here, the code simply converts each count to 1 if >0, and 0 if 0, and runs the multinomial model. There is no worked example but the math should be the same for the matrix of 0s and 1s.
And for the terms not counted in the IIR text, such as "Beijing" that @thomasd2 referred to in #777, the computation is not 4/26 but rather 2/14 = 1/7 for smoothing and 1/8 for no smoothing. And this is computed correctly.
# Beijing: Pr(Beijing|c) = (1 + 1)/(8 + 6) = 1/7 (smoothed)
identical(nb.p261$PwGc["Y", "Beijing"], 1/7)
## [1] TRUE
# Beijing: Pr(Beijing|c) = (1 + 0)/(8 + 0) = 1/8 (unsmoothed)
identical(nb.p261.nosmooth$PwGc["Y", "Beijing"], 1/8)
## [1] TRUE
What I don't understand is why you add six in the smoothing case. Smoothing simply adds one to each cell, doesn't it?? Don't we want the share of words that are Beijing in all documents (3) of the yes class:
# Beijing: Pr(Beijing|c) = ((1+1)+(0+1)+(0+1)/((8 + 3*6)) = 4/26
You're adding one for each word type (6), which increases the denominator used for each likelihood by 6, while you only add 1 to each numerator (one observation for a word per class).
# number of unique word types in the training set
length(unique(featnames(trainingset[1:4, ])))
## [1] 6
There's another good worked example in Jurafsky and Martin 2nd ed. if you want to work through it.
Ok, got it. Apologies - my intuition had been that smoothing adds one to each cell of the dfm and then calculates the likelihood.
dfm_smooth()
does, but textmodel_NB()
collapses the dfm by training class group first, and then does the addition.
Ok re the multinomial case - but did I understand it correctly that you closed the Bernoulli issue as well? (I re-installed tonight from github, version 0.9.9.68).
nb.bern.nosmooth <- textmodel_NB(trainingset, trainingclass,
distribution="Bernoulli",
prior="docfreq",
smooth=0)
nb.bern.smooth <- textmodel_NB(trainingset, trainingclass,
distribution="Bernoulli",
prior="docfreq",
smooth=1)
dBinary <- as.data.frame(trainingset)
dBinary[,] <- ifelse(dBinary[,] > 0, "yes", "no")
dBinary[,] <- lapply(dBinary[,], as.factor)
dBinary
library(e1071)
(nb.e.nosmooth <- e1071::naiveBayes(dBinary, trainingclass ,
laplace=0) )
nb.bern.nosmooth
nb.bern.nosmooth$PwGc[1,]
lapply(nb.e.nosmooth$tables, function(x) x[2,] )
# quanteda solution needs to be multiplied by 2??
(nb.e.smooth <- e1071::naiveBayes(dBinary, trainingclass ,
laplace=1) )
nb.bern.smooth
nb.bern.smooth$PwGc[1,]
lapply(nb.e.smooth$tables, function(x) x[2,] )
You're right @thomasd2 I had Bernoulli programmed wrongly, esp the prediction. I think it matches the algorithm and output in IIR now. See the new files in branch fix-776
.
See PR #777 and comments.