quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Fix #15 by forcing dfm_weight and dfm_group #16

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago

When called on their own, dfm_weight() and dfm_group() will refused to apply more weights or sum counts (respectively) if a dfm has already been weighted, unless the force = TRUE argument is specified. Because these calls are inside all of the supervised model functions, and the default is force = FALSE, this halted with an error when a user wanted to train a model on a weighted dfm, say one that had been weighted by dfm_tfidf(). While it's questionable whether one should weight by tf-idf before fitting a multinomial Naive Bayes classifier, this PR nonetheless forces the weight (or group) and lets the supervised model fitting proceed. Caveat emptor.

Solves #15.

scottdallman commented 4 years ago

Thank you for quickly looking into this. Could you please provide a little more detail regarding your comment on applying the dfm_tfidf() for weighting prior to fitting the Naive Bayes classifier within Quanteda.

  1. I'm still a little confused what weights are initially being applied in the dfm() function prior to the dfm_tfidf() call that dfm_tdidf() is applying an additional weighting method to - are these just the term frequency weights? (example: https://quanteda.io/reference/dfm_tfidf.html)

  2. If its questionable to weight by tf-idf prior to fitting the Naive Bayes, could you provide a minimal work example of how one would estimate the Naive Bayes by using the dfm_tfidf() function?

kbenoit commented 4 years ago

The "idf" multiplies term frequency by the log of inverse frequency, meaning that some non-linear transformation takes place. Naive Bayes multinomial however computes proportions of the supplied dfm to compute the word likelihoods, which are not really word likelihoods when they've been weighted by tf-idf. And because of smoothing, terms zeroed by idf because they occur in every document nonetheless reappear after smoothing.

1 - no weights are applied in the dfm function prior to tf-idf, they are just counts. To understand tf-idf, you should consult the source cited in the documentation you referenced, or look at one of many explanations online.

2 - You don't need the dfm_tfidf() function for Naive Bayes at all. Just input the count-weighted (original) dfm as an input, as per the examples.

scottdallman commented 4 years ago

Thank you for your response - the nonlinear transformation of counts by TF-IDF and differences between it and Naive Bayes is clear - I think I was confused by the function implementation shown in this example which I thought used textmodel_nb() rather than textmodel_svm() -https://github.com/quanteda/quanteda/issues/1646 - Thank you for your continued work on Quanteda.