quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
840 stars 188 forks source link

textmodel_wordscores with smooth > 0 - should it pass smoothed or original x? #1476

Closed sjankin closed 5 years ago

sjankin commented 5 years ago

Describe the bug

textmodel_wordscores produces different results across v1.2.0 and v1.3.(4). This possibly relates to handling of the smoothing parameter. Without smoothing parameter (defaulting to 0) prediction results are the same across versions. Specifying a smoothing parameter (Laplace or Jeffreys) produces different prediction results.

Without smoothing

v1.2.0

packageVersion("quanteda")

(ws <- textmodel_wordscores(data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA), scale = "linear"))
predict(ws)

[1] ‘1.2.0’

Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA), scale = "linear")

Scale: linear; 5 reference scores; 37 scored features.
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

v1.3.4


packageVersion("quanteda")

(ws <- textmodel_wordscores(data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA), scale = "linear"))

predict(ws)

[1] ‘1.3.4’

Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA), scale = "linear")

Scale: linear; 5 reference scores; 37 scored features.
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

With a smoothing parameter:

v1.2.0

packageVersion("quanteda")

(ws <- textmodel_wordscores(data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA), scale = "linear", smooth = 1))

predict(ws)

[1] ‘1.2.0’

Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA), scale = "linear", smooth = 1)

Scale: linear; 5 reference scores; 37 scored features.
           R1            R2            R3            R4            R5            V1 
-1.212047e+00 -6.952988e-01  2.222614e-18  6.952988e-01  1.212047e+00 -4.214674e-01 

v1.3.4

packageVersion("quanteda")

(ws <- textmodel_wordscores(data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA), scale = "linear", smooth = 1))

predict(ws)

[1] ‘1.3.4’

Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA), scale = "linear", smooth = 1)

Scale: linear; 5 reference scores; 37 scored features.
           R1            R2            R3            R4            R5            V1 
-1.256893e+00 -7.210249e-01  3.035766e-18  7.210249e-01  1.256893e+00 -4.370617e-01 

 System information

v1.2

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_1.2.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19       knitr_1.20         bindr_0.1.1        magrittr_1.5       stopwords_0.9.0   
 [6] tidyselect_0.2.4   munsell_0.5.0      colorspace_1.3-2   lattice_0.20-35    R6_2.3.0          
[11] rlang_0.3.0.1      fastmatch_1.1-0    stringr_1.3.1      plyr_1.8.4         dplyr_0.7.6       
[16] tools_3.5.1        grid_3.5.1         data.table_1.11.8  gtable_0.2.0       spacyr_0.9.91     
[21] RcppParallel_4.4.1 yaml_2.2.0         lazyeval_0.2.1     assertthat_0.2.0   tibble_1.4.2      
[26] Matrix_1.2-14      bindrcpp_0.2.2     purrr_0.2.5        ggplot2_3.1.0      glue_1.2.0        
[31] stringi_1.2.4      compiler_3.5.1     pillar_1.2.3       scales_1.0.0       lubridate_1.7.4   
[36] pkgconfig_2.0.1

v1.3.4

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_1.3.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19       knitr_1.20         bindr_0.1.1        magrittr_1.5       stopwords_0.9.0   
 [6] tidyselect_0.2.4   munsell_0.5.0      colorspace_1.3-2   lattice_0.20-35    R6_2.3.0          
[11] rlang_0.3.0.1      fastmatch_1.1-0    stringr_1.3.1      plyr_1.8.4         dplyr_0.7.6       
[16] tools_3.5.1        grid_3.5.1         data.table_1.11.8  gtable_0.2.0       spacyr_0.9.91     
[21] RcppParallel_4.4.1 yaml_2.2.0         lazyeval_0.2.1     assertthat_0.2.0   tibble_1.4.2      
[26] Matrix_1.2-14      bindrcpp_0.2.2     purrr_0.2.5        ggplot2_3.1.0      glue_1.2.0        
[31] stringi_1.2.4      compiler_3.5.1     pillar_1.2.3       scales_1.0.0       lubridate_1.7.4   
[36] pkgconfig_2.0.1   
kbenoit commented 5 years ago

Thanks @sjankin we will investigate asap.

kbenoit commented 5 years ago

The difference is that in 1.3.4, the data stored in the textmodel_wordscores object is not smoothed, even though the word coefficients that it has scored are based on the smoothed dfm. So when you call predict() on the textmodel_wordscores object without specifying newdata, it uses the original, unsmoothed input to the textmodel_wordscores object.

> ws_smooth1 <- textmodel_wordscores(data_dfm_lbgexample, smooth = 1,
+                                         c(seq(-1.5, 1.5, .75), NA), scale = "linear")
>     predict(ws_smooth1)
           R1            R2            R3            R4            R5            V1 
-1.256893e+00 -7.210249e-01  3.035766e-18  7.210249e-01  1.256893e+00 -4.370617e-01 
>     predict(ws_smooth1, newdata = data_dfm_lbgexample)
           R1            R2            R3            R4            R5            V1 
-1.256893e+00 -7.210249e-01  3.035766e-18  7.210249e-01  1.256893e+00 -4.370617e-01 
>     predict(ws_smooth1, newdata = dfm_smooth(data_dfm_lbgexample, smoothing = 1))
           R1            R2            R3            R4            R5            V1 
-1.212047e+00 -6.952988e-01  2.222614e-18  6.952988e-01  1.212047e+00 -4.214674e-01 

Why did we do it this way? Because the idea is to smooth the estimated word scores, without changing the prediction object. Prediction takes place on the actual object, not a smoothed version that has special reasons to avoid zero-count features for the purposes of fitting scores (especially for the logit scale). If you want the smoothing to be part of the object itself, then you can call:

> ws_smooth1a <- textmodel_wordscores(dfm_smooth(data_dfm_lbgexample, smoothing = 1),
                                          c(seq(-1.5, 1.5, .75), NA), scale = "linear")
> predict(ws_smooth1a)
           R1            R2            R3            R4            R5            V1 
-1.212047e+00 -6.952988e-01  2.222614e-18  6.952988e-01  1.212047e+00 -4.214674e-01 

If you think this behaviour should be changed, we are open to arguments as to why, especially if you can point to other R functions in established packages that would follow a similar behaviour.

sjankin commented 5 years ago

That makes sense. Though in practice this led to pretty dramatic change in behavior, especially if wordscores are just part of the analytical pipeline. Most of the previous wordscores results (with smoothing) are now not replicable. Maybe a note in documentation and an example (as above) how to replicate previous behavior?

kbenoit commented 5 years ago

Sure, I'll add something to that effect in the examples. My own practice for maximizing replicability is to explicitly specify arguments whenever possible. So for instance specifying newdata = ... instead of using the default of taking it from the object, but also in dfm_smooth(x, smoothing = 1) rather than dfm_smooth(x) or even dfm_smooth(x, 1) ensures you know what is happening, and do not rely on default values or even the positions of arguments. It also means you are more likely to know what breaks if and when names, positions, or defaults get changed.

HOWEVER in this case we should have noticed the change because of failing unit tests!

sjankin commented 5 years ago

If we are to specify the arguments, presumably we'd want to smooth the dfm first and then tfidf. So from the example above:

> ws_smooth1a <- textmodel_wordscores(dfm_tfidf(dfm_smooth(data_dfm_lbgexample, smoothing = 1)), c(seq(-1.5, 1.5, .75), NA), scale = "linear")
> predict(ws_smooth1a)
Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : 
  no 'dimnames[[.]]': cannot use character indexing
In addition: Warning message:
37 features in newdata not used in prediction. 
kbenoit commented 5 years ago

If you weight a smoothed dfm by tf-idf, then you have an empty dfm!

> head(dfm_tfidf(dfm_smooth(data_dfm_lbgexample, smoothing = 1)), nf = 20)
Document-feature matrix of: 6 documents, 20 features (0% sparse).
6 x 20 sparse Matrix of class "dfm"
    features
docs A B C D E F G H I J K L M N O P Q R S T
  R1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  R2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  R3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  R4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  R5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  V1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

There is no need (or reason) to use tf-idf weighting here. (Removing terms that occur in every document, which is what inverse document frequency weight does, is contrary to the generative model behind wordscores (and in our own AJPS paper).

sjankin commented 5 years ago

Duh, true!

kbenoit commented 5 years ago

I will write a blog entry soon about the inappropriateness of tf-idf weighting since I see a lot of use of this weighting without people fully thinking through the consequences of weighting by inverse document frequency. It gets rid of stop words but also removes all the features common to a particular topic. Like removing economic terms from a policy debate about the economy. For social science text analysis I cannot think of a reason ever to do this. tf-idf is mainly used in information retrieval or by those training a classifier whose only goal is to maximize an F-measure (and it does not always work well for that either).

sjankin commented 5 years ago

Good point. Thanks!

kbenoit commented 5 years ago

Continuing from https://github.com/quanteda/quanteda/pull/1480#issuecomment-434870266 it would be nice to have a definitive view on this, or whether we should also pass some version of the smoothed training set. @patperry @koheiw @sjankin

sjankin commented 5 years ago

I appreciate @koheiw comment. My only concern is from the application side: people have been using textmodel_wordscores for some time (I'd say from the start of quanteda). I guess most of Wordscores applications used the smoothing parameter, since that was the official recommendation. Breaking the behavior makes previous results non-replicable. @kbenoit textmodel_affinity should be a glorious replacement for textmodel_wordscores, so you can implement correct behavior in the affinity model and keep textmodel_wordscores frozen for consistency?