trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

converting dtm/tdm to Corpus and back errors #189

Closed trinker closed 10 years ago

trinker commented 10 years ago
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

as.dtm(as.Corpus(dtm))

I suspect this is do to no providing meta labels in as.Corpus that must be fixed.

trinker commented 10 years ago

Not the case as this produces the same error:

z <- as.Corpus(dtm)
meta(z, "labels") <- names(meta(z, "labels"))
as.DocumentTermMatrix(z)
trinker commented 10 years ago

This does which is what the idea is based on:

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
    paste(rep(names(x), x), collapse=" ")
})

## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)

## Stemming
DocumentTermMatrix(myCorp)
trinker commented 10 years ago

The problem is actually that there is no Corpus method for as.DocumentTermMatrix/as.dtm and as.TermDocumentMatrix/as.tdm. So the following fails as well (using prior example):

as.DocumentTermMatrix(myCorp)

with the same error message:

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),  : 
  dims [product 20] do not match the length of object [3]

However using just DocumentTermMatrix(myCorp) worked. So there was no method for Corpus to convert to the 2 term matrix forms. So as.dtm was using as.dtm.default:

as.dtm.default <- 
function(text.var, grouping.var = NULL, vowel.check = TRUE, ...) {
    tm::as.DocumentTermMatrix(x = text.var, ...)
}

And since tm has no coercion for Corpus using as.DocumentTermMatrix the error happened:

> methods(as.DocumentTermMatrix)
[1] as.DocumentTermMatrix.default*           
[2] as.DocumentTermMatrix.DocumentTermMatrix*
[3] as.DocumentTermMatrix.term_frequency*    
[4] as.DocumentTermMatrix.TermDocumentMatrix*
[5] as.DocumentTermMatrix.textcnt*           

   Non-visible functions are asterisked

So the fix is to make a as.tdm.Corpus and as.dtm.Corpus method as follows:

as.tdm.Corpus <- 
function(text.var, grouping.var = NULL, vowel.check = TRUE, ...) {
    tm::TermDocumentMatrix(x = text.var, ...)
}

as.dtm.Corpus <- 
function(text.var, grouping.var = NULL, vowel.check = TRUE, ...) {
    tm::DocumentTermMatrix(x = text.var, ...)
}