quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

textstat_lexdiv: different results for tokens and dfm objects #55

Open ElisaWirsching opened 1 year ago

ElisaWirsching commented 1 year ago

Describe the bug

I noticed that textstat_lexdiv produces different results, depending on whether a token or dfm object is used in the function. When I calculate the TTR by hand (for example), the figures match perfectly with the output of textstat_lexdiv with a dfm, but differ from the output of the function with a tokens object. Why is this? Is this behavior expected? It is not clear to me from the source code.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

data(data_corpus_inaugural)
reagan_corpus <- corpus_subset(data_corpus_inaugural, Year == 1981 | Year == 1985)
reagan_tokens <- tokens(reagan_corpus, remove_punct = TRUE, remove_numbers = FALSE,
                        remove_symbols = FALSE)
dfm <- dfm(reagan_tokens, tolower=FALSE)
dfm %>% textstat_lexdiv(measure = c("TTR", "R"),
                                remove_numbers = F, remove_punct = T,
                                remove_symbols = F, remove_hyphens = FALSE)

#  --- -   -   - Versus:

reagan_tokens %>% textstat_lexdiv(measure = c("TTR", "R"), 
                                  remove_numbers = F, remove_punct = T,
                                  remove_symbols = F, remove_hyphens = FALSE) 

#  --- -   -   - by hand:

ntype(dfm) /ntoken(dfm) # this is the same as textstat_lexdiv with a dfm
ntype(reagan_tokens) /ntoken(reagan_tokens) # this is the same as textstat_lexdiv with a dfm

Expected behavior

I would expect both methods to return the same estimates for the TTR.

 System information

Please run sessionInfo() and paste the output.

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.textstats_0.95 quanteda.corpora_0.9.2  quanteda_3.2.1         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         pillar_1.6.4       compiler_4.0.2     stopwords_2.2     
 [5] forcats_0.5.0      tools_4.0.2        digest_0.6.27      evaluate_0.14     
 [9] lifecycle_1.0.3    tibble_3.1.0       lattice_0.20-41    tidylog_1.0.2     
[13] pkgconfig_2.0.3    rlang_1.0.6        fastmatch_1.1-0    Matrix_1.4-1      
[17] cli_3.6.0          rstudioapi_0.13    yaml_2.2.1         parallel_4.0.2    
[21] xfun_0.30          fastmap_1.1.0      dplyr_1.1.0        knitr_1.37        
[25] generics_0.1.2     vctrs_0.5.2        grid_4.0.2         tidyselect_1.2.0  
[29] nsyllable_1.0.1    glue_1.4.2         R6_2.5.0           pbapply_1.4-2     
[33] fansi_0.4.2        rmarkdown_2.14     pacman_0.5.1       tidyr_1.2.0       
[37] purrr_0.3.4        magrittr_2.0.1     clisymbols_1.2.0   ellipsis_0.3.2    
[41] htmltools_0.5.2    corpus_0.10.1      stringdist_0.9.8   utf8_1.1.4        
[45] stringi_1.5.3      RcppParallel_5.0.3 crayon_1.4.1