Open kbenoit opened 6 years ago
Comment by kbenoit Friday Jan 20, 2017 at 14:02 GMT
That was a bug in textmodel_wordfish
for reporting about how it removes zero-feature count documents, now fixed.
But: It creates a problem in textmodel_wordshoal
for your data because one "document" contains only a single feature count:
which(ntoken(dfm14)==1)
# text3885
# 3885
The problem here is that in iteration 312 of M
in line 110 of textmodel-wordshoal.R
, this feature is removed before sending to textmodel_wordfish
. (Note to @lauderdale : this line does not implement what the comment states it does, which refers to document frequency and not term frequency as in the code.)
This results in the removal of the only feature, causing a document with all zero feature counts to be sent to textmodel_wordfish
, which removes this document in lines 118-121. This results in a different length of theta being returned to textmodel_wordshoal
and causes a bug/warning:
Warning message:
In psi[groups == levels(groups)[j]] <- wfresult@theta :
number of items to replace is not a multiple of replacement length
@lauderdale needs to fix this one.
Comment by kbenoit Friday Jan 20, 2017 at 14:02 GMT
But a workaround for you @methodds is to get rid of that uninformative document!
Comment by methodds Friday Jan 20, 2017 at 14:21 GMT
Thanks for the hint. Is there an easy way to remove rows in dfm's with, say less than 5 non-zero entries and return the corresponding index? This could be used to remove such uninformative documents and the corresponding metadata in a dataframe/corpus.
Edit: your example from above works perfectly for this, thanks.
Comment by kbenoit Friday Jan 20, 2017 at 14:59 GMT
dfm_trim
or just use indexing to subset documents based on a condition
Comment by methodds Friday Jan 20, 2017 at 15:06 GMT
Sorry to highjack this again, but I don't wont to spam you with 240982340 issues: How can I get the group level estimates out of a fitted wordshoal model? Summary only returns the author specific estimates.
Comment by kbenoit Wednesday May 24, 2017 at 12:51 GMT
@lauderdale I'm removing this as a PR and leaving it as an open issue. If you want to pick up where I started, it's branch dev_wordshoal
.
@methodds can you check if this bug still exists?
It still throws a looot of these warnings:
In textmodel_wordfish.dfm(groupdfm, tol = c(tol, 1e-08)) :
Warning: The algorithm did not converge.
But the example from above does not raise an error anymore :)
Regarding the warnings: Is this something I should be alarmed about? I have no idea how you came up with the default convergence tolerance, but maybe this parameter needs to be tuned for wordshoal? For the example above this might very well just be a data issue though.
The warning comes from the debate level estimates. Since wordshoal runs a lot of wordfish models, some of them may not converge especially the one with a small number of speakers. Maybe it makes sense to remove these non-converged debates from the second stage, but I haven't decided whether this is the way to go.
A question for @lauderdale ...
Issue by methodds Friday Jan 20, 2017 at 13:10 GMT Originally opened as https://github.com/kbenoit/quanteda/issues/488
After finding single row groups / authors thanks to #481 , using
textmodel_wordshoal
raises another error:A reproduction file is available here.