quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Issue: textmodel_nb() and dfm_tfidf() -- Error: will not group a weighted dfm; use force = TRUE to override #15

Closed scottdallman closed 4 years ago

scottdallman commented 4 years ago

Describe the bug

Attempting to use dfm with tfidf weighting scheme dfm_tfidf() within textmodel_nb() but receive the following error: `Error: will not group a weighted dfm; use force = TRUE to override'

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.


rm(list=ls())
packages = c("tm",
             "dplyr",
             "SnowballC", 
             "data.table",
             "foreign",
             "haven",
             "magrittr",
             "quanteda",
             "textclean",
             "tidytext",
             "tidyverse",
             "topicmodels"
)

# install.packages(packages)
# update.packages(packages)
lapply(packages, require, character.only = TRUE)

# data_corpus <- corpus(data_corpus_inaugural) #, docvars = data.frame(party = names(data_corpus_inaugural)))

dfm = dfm(x = data_corpus_inaugural, 
          tolower = TRUE, 
          stem = TRUE, 
          remove_punct = TRUE, 
          ngrams = 1:2,
          verbose = TRUE
)

# remove stopwords after stemming and sparse
dfm = dfm(x = dfm, 
          tolower = FALSE,
          remove = stopwords("english"), 
          # remove = c(stopwords("english"), additional.stopwords), 
          # stem = TRUE, 
          # remove_punct = TRUE, 
          # ngrams = 1:2,
          verbose = TRUE
)

dfm = dfm_tfidf(dfm, force = TRUE)

#######################################################
# naive bayes multinomial model
#######################################################

docvars(dtm, "is_prewar") <- docvars(dtm, "Year") < 1945 

# train_dfm = dfm
train_dtm <- dfm_sample(dtm, size = 40)
test_dtm <- dtm[setdiff(docnames(dtm), docnames(train_dtm)), ] 

set.seed(2216100)

# error message here with td-idf:  
# Error: will not group a weighted dfm; use force = TRUE to override
model = textmodel_nb(train_dfm, y = docvars(train_dfm, "Year"))

# Doesn't work with 'force' option either
model = textmodel_nb(train_dfm, y = docvars(train_dfm, "Year", force = TRUE))
predict.model = predict(model, newdata = test_dfm)

# confusion table (in sample)
table(prediction = predict.model, training_data_id = docvars(test_dfm, "training_data_id"))

# predicted.values = data.table(predict.model)
docvars(dfm, "svm_relevant") = predict.model

table(predict.model)

# top 50 features w/ frequencies
topfeatures(dfm, 50)

Expected behavior

Would like textmodel_nb() to accept dfm_tfidf() object and return

 System information

Please run sessionInfo() and paste the output.

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xml2_1.2.2           wordcloud_2.6        RColorBrewer_1.1-2   udpipe_0.8.3         jsonlite_1.6         topicmodels_0.2-9   
 [7] forcats_0.4.0        stringr_1.4.0        purrr_0.3.3          readr_1.3.1          tidyr_1.0.0          tibble_2.1.3        
[13] ggplot2_3.2.1        tidyverse_1.3.0      tidytext_0.2.2       textrank_0.3.0       textclean_0.9.3      stm_1.3.5           
[19] sparklyr_1.1.0       spacyr_1.2           readxl_1.3.1         microbenchmark_1.4-7 magrittr_1.5         haven_2.2.0         
[25] foreign_0.8-71       data.table_1.12.8    SnowballC_0.6.0      doParallel_1.0.15    iterators_1.0.12     foreach_1.4.7       
[31] dplyr_0.8.3          e1071_1.7-3          tm_0.7-7             NLP_0.2-0            quanteda_1.5.2      

loaded via a namespace (and not attached):
 [1] nlme_3.1-140        fs_1.3.1            lubridate_1.7.4     httr_1.4.1          rprojroot_1.3-2     tools_3.6.1         backports_1.1.4    
 [8] R6_2.4.0            DBI_1.0.0           lazyeval_0.2.2      colorspace_1.4-1    withr_2.1.2         tidyselect_0.2.5    compiler_3.6.1     
[15] cli_1.1.0           rvest_0.3.5         textshape_1.6.0     forge_0.2.0         slam_0.1-47         scales_1.0.0        digest_0.6.20      
[22] base64enc_0.1-3     pkgconfig_2.0.2     htmltools_0.4.0     dbplyr_1.4.2        htmlwidgets_1.3     rlang_0.4.2         rstudioapi_0.10    
[29] generics_0.0.2      qdapRegex_0.7.2     tokenizers_0.2.1    modeltools_0.2-22   Matrix_1.2-17       Rcpp_1.0.2          munsell_0.5.0      
[36] lifecycle_0.1.0     stringi_1.4.3       grid_3.6.1          crayon_1.3.4        lattice_0.20-38     hms_0.5.3           zeallot_0.1.0      
[43] pillar_1.4.2        igraph_1.2.4.2      codetools_0.2-16    stopwords_1.0       stats4_3.6.1        fastmatch_1.1-0     reprex_0.3.0       
[50] glue_1.3.1          RcppParallel_4.4.4  modelr_0.1.5        lexicon_1.2.1       vctrs_0.2.1         cellranger_1.1.0    gtable_0.3.0       
[57] assertthat_0.2.1    r2d3_0.2.3          syuzhet_1.0.4       broom_0.5.2         janeaustenr_0.1.5   class_7.3-15        ISOcodes_2019.12.22
[64] ellipsis_0.3.0   

Additional info

kbenoit commented 4 years ago

Reproducible example:

library("quanteda.textmodels")
## Loading required package: quanteda
## Package version: 2.0.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

txt <- c(
  d1 = "Chinese Beijing Chinese",
  d2 = "Chinese Chinese Shanghai",
  d3 = "Chinese Macao",
  d4 = "Tokyo Japan Chinese",
  d5 = "Chinese Chinese Chinese Tokyo Japan"
)
trset <- dfm(txt, tolower = FALSE)
trclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

tmod1 <-
  textmodel_nb(trset, y = trclass, prior = "docfreq")

tmod2 <-
  textmodel_nb(dfm_tfidf(trset), y = trclass, prior = "docfreq")
## Error: will not group a weighted dfm; use force = TRUE to override
scottdallman commented 4 years ago

Thank you for quickly looking into this. Could you please provide a little more detail regarding your comment on applying the dfm_tfidf() for weighting prior to fitting the Naive Bayes classifier within Quanteda.

  1. I'm still a little confused what weights are initially being applied in the dfm() function prior to the dfm_tfidf() call that dfm_tdidf() is applying an additional weighting method to - are these just the term frequency weights? (example: https://quanteda.io/reference/dfm_tfidf.html)

  2. If its questionable to weight by tf-idf prior to fitting the Naive Bayes, could you provide a minimal work example of how one would estimate the Naive Bayes by using the dfm_tfidf() function?

kbenoit commented 4 years ago

Answer moved to https://github.com/quanteda/quanteda.textmodels/pull/16#issuecomment-591111085