quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

(untrimmed) textmodel_wordfish leads to abortion of R session, while trimmed model works #59

Closed lwarode closed 1 year ago

lwarode commented 1 year ago

Hi all!

When trying to conduct an analysis of speeches of the 19th German Bundestag I'm trying to use wordfish, which works fine (models is done running in a few seconds for model with min term freq 25 and in a few minutes for model with min term freq 10) on trimmed dfms, while the untrimmed won't finish and leads to aborting the R session.

The respective dfms do not considerably vary in their sizes, so I don't really know where this heavy failure comes from. I applied the following steps (trying to extract the meaningful code parts from my scripts below). The formatting of the piped workflow is not perfectly copyable, please excuse that.

dfm_raw <- corpus_et_19 %>% dfm()

dfm_party <- dfm_raw %>% dfm_group(party_short) %>% dfm_trim(min_termfreq = 25) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))

These setups work (model_wf, model_wf_2):

model_wf <- textmodel_wordfish(dfm_party) pred_model_wf <- predict(model_wf, se.fit = TRUE) textplot_scale1d(pred_model_wf)

dfm_party_2 <- dfm_raw %>% dfm_group(party_short) %>% dfm_trim(min_termfreq = 10) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))

model_wf_2 <- textmodel_wordfish(dfm_party_2) pred_model_wf_2 <- predict(model_wf_2, se.fit = TRUE) textplot_scale1d(pred_model_wf_2)

While this model (model_wf_3; no trimming applied) directly leads to aborting the R session:

dfm_party_3 <- dfm_raw %>% dfm_group(party_short) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))

model_wf_3 <- textmodel_wordfish(dfm_party_3) pred_model_wf_3 <- predict(model_wf_3, se.fit = TRUE) textplot_scale1d(pred_model_wf_3)

image

Thanks in advance.

kbenoit commented 1 year ago

Well, you have 14.8 billion cells and you are trying to estimate two parameters for every feature and for every document. That alone will likely cause your computer to crash - but without any machine information we cannot really tell.

When your matrix is very sparse, you are also asking the estimation routine to perform a pretty impossible task, since most of it consists of zeroes. How sparse is dfm_raw? My guess is >= 95%. You can try setting sparse = TRUE but the better solution is to trim the dim as you have done.

lwarode commented 1 year ago

Thanks for the answer. Please note that I'm not using dfm_raw when fitting the model, which I would also expect to crash, given its size. dfm_party_3 has a sparsity of 65.52% (it is grouped/aggregated by party), while dfm_raw is of course very sparse (99.96%).

image

model_wf_3 <- textmodel_wordfish(dfm_party_3) leads to aborting the R session. dfm_party and dfm_party_2 run without any problem while being not considerably smaller.

image

I'm working on an 16GB RAM M1 Chip Macbook Air (2020).

image

UPDATE: Apparently, setting sparse = TRUE works (model is done running within 1-2 seconds), however, I'm not sure whether I understand the implications regarding the documentation: "While setting this to TRUE will make it possible to handle larger dfm objects (and make execution faster), it will generate slightly different results each time, because the sparse SVD routine has a stochastic element." Is there a computational work-around to get rid of working with sparse/stochastic estimation that does not lead to aborting the/my R session?

model_wf_3 <- textmodel_wordfish(dfm_party_3, sparse = TRUE)

image

I will just share the details of the estimated model. I also tried out how setting sparse = TRUE affected the smaller models and the difference was really really small for both dfm_party and dfm_party_2.

In addition, substantively speaking, the differences across the dfm/model setups (different levels of trimmings) seem to be very neglectable.

Thanks in advance!