Issue: textmodel_nb() and dfm_tfidf() -- Error: will not group a weighted dfm; use force = TRUE to override #15

Describe the bug

Attempting to use dfm with tfidf weighting scheme dfm_tfidf() within textmodel_nb() but receive the following error: `Error: will not group a weighted dfm; use force = TRUE to override'

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

packages = c("tm",

# install.packages(packages)
# update.packages(packages)
lapply(packages, require, character.only = TRUE)

# data_corpus <- corpus(data_corpus_inaugural) #, docvars = data.frame(party = names(data_corpus_inaugural)))

dfm = dfm(x = data_corpus_inaugural, 
          tolower = TRUE, 
          stem = TRUE, 
          remove_punct = TRUE, 
          ngrams = 1:2,
          verbose = TRUE

# remove stopwords after stemming and sparse
dfm = dfm(x = dfm, 
          tolower = FALSE,
          remove = stopwords("english"), 
          # remove = c(stopwords("english"), additional.stopwords), 
          # stem = TRUE, 
          # remove_punct = TRUE, 
          # ngrams = 1:2,
          verbose = TRUE

dfm = dfm_tfidf(dfm, force = TRUE)

# naive bayes multinomial model

docvars(dtm, "is_prewar") <- docvars(dtm, "Year") < 1945 

# train_dfm = dfm
train_dtm <- dfm_sample(dtm, size = 40)
test_dtm <- dtm[setdiff(docnames(dtm), docnames(train_dtm)), ] 


# error message here with td-idf:  
# Error: will not group a weighted dfm; use force = TRUE to override
model = textmodel_nb(train_dfm, y = docvars(train_dfm, "Year"))

# Doesn't work with 'force' option either
model = textmodel_nb(train_dfm, y = docvars(train_dfm, "Year", force = TRUE))
predict.model = predict(model, newdata = test_dfm)

# confusion table (in sample)
table(prediction = predict.model, training_data_id = docvars(test_dfm, "training_data_id"))

# predicted.values = data.table(predict.model)
docvars(dfm, "svm_relevant") = predict.model


# top 50 features w/ frequencies
topfeatures(dfm, 50)

Expected behavior

Would like textmodel_nb() to accept dfm_tfidf() object and return

Additional info

kbenoit commented 4 years ago

Reproducible example:

## Loading required package: quanteda
## Package version: 2.0.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##     View

txt <- c(
  d1 = "Chinese Beijing Chinese",
  d2 = "Chinese Chinese Shanghai",
  d3 = "Chinese Macao",
  d4 = "Tokyo Japan Chinese",
  d5 = "Chinese Chinese Chinese Tokyo Japan"
trset <- dfm(txt, tolower = FALSE)
trclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

tmod1 <-
  textmodel_nb(trset, y = trclass, prior = "docfreq")

tmod2 <-
  textmodel_nb(dfm_tfidf(trset), y = trclass, prior = "docfreq")
## Error: will not group a weighted dfm; use force = TRUE to override
scottdallman commented 4 years ago

Thank you for quickly looking into this. Could you please provide a little more detail regarding your comment on applying the dfm_tfidf() for weighting prior to fitting the Naive Bayes classifier within Quanteda.

  1. I'm still a little confused what weights are initially being applied in the dfm() function prior to the dfm_tfidf() call that dfm_tdidf() is applying an additional weighting method to - are these just the term frequency weights? (example: https://quanteda.io/reference/dfm_tfidf.html)

  2. If its questionable to weight by tf-idf prior to fitting the Naive Bayes, could you provide a minimal work example of how one would estimate the Naive Bayes by using the dfm_tfidf() function?

kbenoit commented 4 years ago

