kbenoit / wordshoal

quanteda implementation of the Lauderdale and Herzog (2016) "Wordshoal" model
13 stars 2 forks source link

SVD failure #11

Closed cschwem2er closed 4 years ago

cschwem2er commented 6 years ago

Hi,

using the latest version of wordshoal and quanteda, the following example (dfm for reproduction available here) raises an error:

set.seed(1337)
shoal_15 <- textmodel_wordshoal(dfm15d,  groups =  docvars(dfm15d,'TopID'), 
                                authors = docvars(dfm15d, 'AgentID'))

Error in qatd_cpp_wordfish(x, as.integer(dir), 1/(priors^2), tol, disp, : Mat::operator(): index out of bounds 

6. stop(structure(list(message = "Mat::operator(): index out of bounds", call = qatd_cpp_wordfish(x, as.integer(dir), 1/(priors^2), tol, disp, dispersion_floor, abs_err, svd_sparse, residual_floor), cppstack = NULL), .Names = c("message", "call", "cppstack" ... 
5. qatd_cpp_wordfish(x, as.integer(dir), 1/(priors^2), tol, disp, dispersion_floor, abs_err, svd_sparse, residual_floor) 
4. textmodel_wordfish.dfm(groupdfm, tol = c(tol, 1e-08)) 
3. quanteda::textmodel_wordfish(groupdfm, tol = c(tol, 1e-08)) 
2. textmodel_wordshoal.dfm(dfm15d, groups = docvars(dfm15d, "TopID"), authors = docvars(dfm15d, "AgentID")) 
1. textmodel_wordshoal(dfm15d, groups = docvars(dfm15d, "TopID"), authors = docvars(dfm15d, "AgentID"))

This seems to happen during the wordfish SVD for one of the debates.

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2       scales_0.5.0.9000  reshape2_1.4.3     sjmisc_2.7.0       lubridate_1.7.2    stm_1.3.3         
 [7] wordshoal_0.3      quanteda_1.0.2     forcats_0.2.0      stringr_1.2.0      dplyr_0.7.4        purrr_0.2.4       
[13] readr_1.1.1        tidyr_0.8.0        tibble_1.4.2       ggplot2_2.2.1.9000 tidyverse_1.2.1   

loaded via a namespace (and not attached):
 [1] stringdist_0.9.4.6  network_1.13.0      tidyselect_0.2.3    sjlabelled_1.0.7    haven_1.1.1         lattice_0.20-35    
 [7] snakecase_0.8.1     colorspace_1.3-2    yaml_2.1.16         rlang_0.1.6.9003    pillar_1.1.0        foreign_0.8-69     
[13] glue_1.2.0          withr_2.1.1.9000    modelr_0.1.1        readxl_1.0.0        bindr_0.1           plyr_1.8.4         
[19] munsell_0.4.3       gtable_0.2.0        cellranger_1.1.0    prediction_0.2.0    rvest_0.3.2         psych_1.7.8        
[25] knitr_1.19          parallel_3.4.3      broom_0.4.3         Rcpp_0.12.15        spacyr_0.9.6        RcppParallel_4.3.20
[31] jsonlite_1.5        fastmatch_1.1-0     mnormt_1.5-5        stopwords_0.9.0     hms_0.4.1           stringi_1.1.6      
[37] ggrepel_0.7.0       grid_3.4.3          cli_1.0.0           tools_3.4.3         magrittr_1.5        lazyeval_0.2.1     
[43] crayon_1.3.4        pkgconfig_2.0.1     Matrix_1.2-12       data.table_1.10.4-3 xml2_1.2.0          assertthat_0.2.0   
[49] httr_1.3.1          rstudioapi_0.7      R6_2.2.2            nlme_3.1-131        compiler_3.4.3     
amatsuo commented 6 years ago

It seems that the issue is caused by a specific "debate" in the data which cannot compatible with quanteda. After cleaning up data a bit, I found a specific debate which causes error for textmodel_wordfish(). Here is a replication code.

load("~/Downloads/wordshoal_svd_error.RData")
library(stringi)
library(quanteda)
library(wordshoal)
docvars(dfm15d, c('TopID', "AgentID")) %>% table %>% table
dfm15d_group <- dfm_group(dfm15d, groups = c("AgentID", 'TopID'))
docvars(dfm15d_group, "AgentID") <- substr(rownames(dfm15d_group), 1, 7)
docvars(dfm15d_group, "TopID") <- stri_replace_first_regex(rownames(dfm15d_group), "^.{8}", "")

docvars(dfm15d_group,'TopID') %>% factor %>% levels -> group_names

group_names[19]
dfm_su <- dfm_subset(dfm15d_group, TopID == group_names[19])
textmodel_wordfish(dfm_su)

I think this is a behaviour which should not happen with wordfish. The debate number is "15000037".

I suspected the rank of the matrix would be an issue but that was not the case.

> library(Matrix)
> dim(dfm_su)
[1]    8 4421
> dfm_su %>% convert('matrix') %>% rankMatrix()
[1] 8
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 9.816592e-13

@kbenoit, should I file an issue at quanteda?

kbenoit commented 6 years ago

It's a known issue with the sparse SVD we use for starting values. I'm beginning to think this was not such a great idea.

It works fine if you call it as

textmodel_wordfish(dfm_trim(dfm_su), sparse = FALSE)

which disables the warning messages about empty features (the dfm_trim part) and avoids the SVD failure (the sparse = FALSE part). We should probably set sparse = FALSE as the default for textmodel_wordfish(), but definitely also should add an option to pass this through from textmodel_wordshoal().

cschwem2er commented 6 years ago

if you allow to pass this parameter to wordshoal, this would also eliminate the non-deterministic behavior right? What's the tradeoff? "Stupid" starting values?

I don't know a lot about latent variable models and finding starting values, but in the long run, is the deterministic spectral initialization that Brandon uses for stm an option for wordfish as well?

kbenoit commented 6 years ago

Yes that's exactly the "known issue" to which I referred. I'm going to change shortly in quanteda forso thattextmodel_wordfish(x, sparse = FALSE), which will also solve the problem. But agree we should allow passing parameters (probably through ...) to wordfish through textmodel_wordshoal().

kbenoit commented 6 years ago

I just changed the default in quanteda.