Closed kbenoit closed 4 years ago
Thank you for quickly looking into this. Could you please provide a little more detail regarding your comment on applying the dfm_tfidf() for weighting prior to fitting the Naive Bayes classifier within Quanteda.
I'm still a little confused what weights are initially being applied in the dfm() function prior to the dfm_tfidf() call that dfm_tdidf() is applying an additional weighting method to - are these just the term frequency weights? (example: https://quanteda.io/reference/dfm_tfidf.html)
If its questionable to weight by tf-idf prior to fitting the Naive Bayes, could you provide a minimal work example of how one would estimate the Naive Bayes by using the dfm_tfidf() function?
The "idf" multiplies term frequency by the log of inverse frequency, meaning that some non-linear transformation takes place. Naive Bayes multinomial however computes proportions of the supplied dfm to compute the word likelihoods, which are not really word likelihoods when they've been weighted by tf-idf. And because of smoothing, terms zeroed by idf because they occur in every document nonetheless reappear after smoothing.
1 - no weights are applied in the dfm function prior to tf-idf, they are just counts. To understand tf-idf, you should consult the source cited in the documentation you referenced, or look at one of many explanations online.
2 - You don't need the dfm_tfidf()
function for Naive Bayes at all. Just input the count-weighted (original) dfm as an input, as per the examples.
Thank you for your response - the nonlinear transformation of counts by TF-IDF and differences between it and Naive Bayes is clear - I think I was confused by the function implementation shown in this example which I thought used textmodel_nb() rather than textmodel_svm()
-https://github.com/quanteda/quanteda/issues/1646 - Thank you for your continued work on Quanteda.
When called on their own,
dfm_weight()
anddfm_group()
will refused to apply more weights or sum counts (respectively) if a dfm has already been weighted, unless theforce = TRUE
argument is specified. Because these calls are inside all of the supervised model functions, and the default isforce = FALSE
, this halted with an error when a user wanted to train a model on a weighted dfm, say one that had been weighted bydfm_tfidf()
. While it's questionable whether one should weight by tf-idf before fitting a multinomial Naive Bayes classifier, this PR nonetheless forces the weight (or group) and lets the supervised model fitting proceed. Caveat emptor.Solves #15.