Index Error for wordshoal

kbenoit commented 6 years ago

Issue by methodds Friday Jan 20, 2017 at 13:10 GMT Originally opened as https://github.com/kbenoit/quanteda/issues/488

After finding single row groups / authors thanks to #481 , using textmodel_wordshoal raises another error:

shoal14 <- textmodel_wordshoal(dfm14, groups = d14p$topic_id, 
                        authors =  d14p$person_id )

Scaling 818 document groups1 2 3 4 5 6 [...] 311

Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : index larger than maximal 10
14.
stop(gettextf("index larger than maximal %d", n), domain = NA)
13.
intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE)
12.
subCsp_rows(x, i, drop = drop)
11.
as(x, "Matrix")[i, , ..., drop = FALSE]
10.
as(x, "Matrix")[i, , ..., drop = FALSE]
9.
data[zeroLengthDocs, ]
8.
data[zeroLengthDocs, ]
7.
docnames(data[zeroLengthDocs, ])
6.
paste(..., sep = sep)
5.
message(paste(..., sep = sep), appendLF = appendLF)
4.
catm("Note: removed the following zero-token documents:", docnames(data[zeroLengthDocs, ]), "\n")
3.
textmodel_wordfish(groupdfm, dir = dir, tol = c(tol, 1e-08))
2.
textmodel_wordshoal.dfm(dfm14, groups = d14p$topic_id, authors = d14p$person_id)
1.
textmodel_wordshoal(dfm14, groups = d14p$topic_id, authors = d14p$person_id)

A reproduction file is available here.

kbenoit commented 6 years ago

Comment by kbenoit Friday Jan 20, 2017 at 14:02 GMT

That was a bug in textmodel_wordfish for reporting about how it removes zero-feature count documents, now fixed.

But: It creates a problem in textmodel_wordshoal for your data because one "document" contains only a single feature count:

which(ntoken(dfm14)==1)
# text3885 
#     3885

The problem here is that in iteration 312 of M in line 110 of textmodel-wordshoal.R, this feature is removed before sending to textmodel_wordfish. (Note to @lauderdale : this line does not implement what the comment states it does, which refers to document frequency and not term frequency as in the code.)

This results in the removal of the only feature, causing a document with all zero feature counts to be sent to textmodel_wordfish, which removes this document in lines 118-121. This results in a different length of theta being returned to textmodel_wordshoal and causes a bug/warning:

Warning message:
In psi[groups == levels(groups)[j]] <- wfresult@theta :
  number of items to replace is not a multiple of replacement length

@lauderdale needs to fix this one.

kbenoit commented 6 years ago

Comment by kbenoit Friday Jan 20, 2017 at 14:02 GMT

But a workaround for you @methodds is to get rid of that uninformative document!

kbenoit commented 6 years ago

Comment by methodds Friday Jan 20, 2017 at 14:21 GMT

Thanks for the hint. Is there an easy way to remove rows in dfm's with, say less than 5 non-zero entries and return the corresponding index? This could be used to remove such uninformative documents and the corresponding metadata in a dataframe/corpus.

Edit: your example from above works perfectly for this, thanks.

kbenoit commented 6 years ago

Comment by kbenoit Friday Jan 20, 2017 at 14:59 GMT

dfm_trim

or just use indexing to subset documents based on a condition

kbenoit commented 6 years ago

Comment by methodds Friday Jan 20, 2017 at 15:06 GMT

Sorry to highjack this again, but I don't wont to spam you with 240982340 issues: How can I get the group level estimates out of a fitted wordshoal model? Summary only returns the author specific estimates.

kbenoit commented 6 years ago

Comment by kbenoit Friday Jan 20, 2017 at 15:46 GMT

A question for @lauderdale

kbenoit commented 6 years ago

Comment by kbenoit Wednesday May 24, 2017 at 12:51 GMT

@lauderdale I'm removing this as a PR and leaving it as an open issue. If you want to pick up where I started, it's branch dev_wordshoal.

kbenoit commented 6 years ago

@methodds can you check if this bug still exists?

cschwem2er commented 6 years ago

It still throws a looot of these warnings:

In textmodel_wordfish.dfm(groupdfm, tol = c(tol, 1e-08)) :
  Warning: The algorithm did not converge.

But the example from above does not raise an error anymore :)

Regarding the warnings: Is this something I should be alarmed about? I have no idea how you came up with the default convergence tolerance, but maybe this parameter needs to be tuned for wordshoal? For the example above this might very well just be a data issue though.

amatsuo commented 6 years ago

The warning comes from the debate level estimates. Since wordshoal runs a lot of wordfish models, some of them may not converge especially the one with a small number of speakers. Maybe it makes sense to remove these non-converged debates from the second stage, but I haven't decided whether this is the way to go.

kbenoit commented 6 years ago

A question for @lauderdale ...

kbenoit / wordshoal

Index Error for wordshoal #2