Closed lwarode closed 1 year ago
Well, you have 14.8 billion cells and you are trying to estimate two parameters for every feature and for every document. That alone will likely cause your computer to crash - but without any machine information we cannot really tell.
When your matrix is very sparse, you are also asking the estimation routine to perform a pretty impossible task, since most of it consists of zeroes. How sparse is dfm_raw
? My guess is >= 95%. You can try setting sparse = TRUE
but the better solution is to trim the dim as you have done.
Thanks for the answer. Please note that I'm not using dfm_raw
when fitting the model, which I would also expect to crash, given its size. dfm_party_3
has a sparsity of 65.52% (it is grouped/aggregated by party), while dfm_raw
is of course very sparse (99.96%).
model_wf_3 <- textmodel_wordfish(dfm_party_3)
leads to aborting the R session. dfm_party
and dfm_party_2
run without any problem while being not considerably smaller.
I'm working on an 16GB RAM M1 Chip Macbook Air (2020).
UPDATE: Apparently, setting sparse = TRUE
works (model is done running within 1-2 seconds), however, I'm not sure whether I understand the implications regarding the documentation: "While setting this to TRUE
will make it possible to handle larger dfm objects (and make execution faster), it will generate slightly different results each time, because the sparse SVD routine has a stochastic element." Is there a computational work-around to get rid of working with sparse/stochastic estimation that does not lead to aborting the/my R session?
model_wf_3 <- textmodel_wordfish(dfm_party_3, sparse = TRUE)
I will just share the details of the estimated model. I also tried out how setting sparse = TRUE
affected the smaller models and the difference was really really small for both dfm_party
and dfm_party_2
.
In addition, substantively speaking, the differences across the dfm/model setups (different levels of trimmings) seem to be very neglectable.
Thanks in advance!
Hi all!
When trying to conduct an analysis of speeches of the 19th German Bundestag I'm trying to use wordfish, which works fine (models is done running in a few seconds for model with min term freq 25 and in a few minutes for model with min term freq 10) on trimmed dfms, while the untrimmed won't finish and leads to aborting the R session.
The respective dfms do not considerably vary in their sizes, so I don't really know where this heavy failure comes from. I applied the following steps (trying to extract the meaningful code parts from my scripts below). The formatting of the piped workflow is not perfectly copyable, please excuse that.
dfm_raw <- corpus_et_19 %>% dfm()
dfm_party <- dfm_raw %>% dfm_group(party_short) %>% dfm_trim(min_termfreq = 25) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))
These setups work (model_wf, model_wf_2):
model_wf <- textmodel_wordfish(dfm_party) pred_model_wf <- predict(model_wf, se.fit = TRUE) textplot_scale1d(pred_model_wf)
dfm_party_2 <- dfm_raw %>% dfm_group(party_short) %>% dfm_trim(min_termfreq = 10) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))
model_wf_2 <- textmodel_wordfish(dfm_party_2) pred_model_wf_2 <- predict(model_wf_2, se.fit = TRUE) textplot_scale1d(pred_model_wf_2)
While this model (model_wf_3; no trimming applied) directly leads to aborting the R session:
dfm_party_3 <- dfm_raw %>% dfm_group(party_short) %>% dfm_subset(!(party_short %in% c("Fraktionslos", "not found")))
model_wf_3 <- textmodel_wordfish(dfm_party_3) pred_model_wf_3 <- predict(model_wf_3, se.fit = TRUE) textplot_scale1d(pred_model_wf_3)
Thanks in advance.