Correlation among all words in a term document matrix

FinScience commented 7 years ago

Hi, I have a term document matrix which after filtering has 815026 terms and 191431 docs. My idea is to get a correlation between a term and all other terms. I tried using

     findAssocs( filt_doc_term_mat, term_con,pearson_thres)

However, this is taking a huge amount of time even to find correlation of one term with all others. Is there any better way to approach this through qdap?

trinker commented 7 years ago

No but I'd argue correlation is the wrong measure. I'd opt for cosine similarity instead: https://github.com/trinker/clustext/blob/master/R/cosine_distance.R

FinScience commented 7 years ago

@trinker - Thanks a lot for the input. I am getting a few issues though. One thing is that my tdm is huge in size. So the size allocation error is popping up. I tried to it through clusters but it failed. Since I am pretty new with parallel R, a bit of assistance would be great. The code that I am using is shared below(the tdm is different though).

 library(tm)
 library(parallel)
 if (!require("pacman")) install.packages("pacman")
 pacman::p_load(clustext, dplyr, textshape, ggplot2, tidyr)

 data("crude")
 tdm <- TermDocumentMatrix(crude)

 no_cores <- detectCores() - 1

 cl <- makeCluster(no_cores)
 clusterExport(cl,'tdm')
 x = parLapply(cl,tdm, fun = cosine_distance.DocumentTermMatrix(tdm))

The output that I am looking for is something like below

  Word   Related_Word  cosine_distance
    oil        opec                   0.5
    oil        spill                  0.3
   .....................................................
   .....................................................

I am pretty sure that I am not doing it the right way. Any assistance in this would be great

trinker commented 7 years ago

Please provide the error message. Also why are you using it in parallel? Have you tried it not in parallel?

FinScience commented 7 years ago

I did try it normally. The dataset I have is huge. The one mentioned here is just for reference. The error with this code is as follows

    Error in checkForRemoteErrors(val) : 
    3 nodes produced errors; first error: c("'quote(structure(c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,     
    0.183503419072274, ' is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,        
    1, 1, 1, 1, 1, ' is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,   
     1, ' is not a function, character or symbol", "'1, 1, 1, 0, 1, 1, 0.552786404500042, 1, 1, 1, 1, 1, 1, 1, 1, ' 
      is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ' is not a  
      function, character or symbol", 
     "'1, 1, 1, 1, 1, 1, 1, 1, 1, 0.776393202250021, 1, 1, 1, 1, 1, ' is not a function, character or symbol", "'1,   
      1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ' is not a function, character or symbol", "'0, 0, 1, 1,   
      1, 1, 1, 1, 1, 1, 0.422649730810374, 1, 1, 1, 1, ' is not a function, character or symbol", "'1, 1, 
     0.666666666666667, 1, 1, 1, 1, 1, 1, 1, 1, 0.51243357210427, ' is n

On using it on my dataset, without parallel processing the following is the error.

       cosine_distance.DocumentTermMatrix(doc_term_mat)
        Error: cannot allocate vector of size 1162.4 Gb

trinker commented 7 years ago

Gotcha. Making the Sorry but I don't think I have any tools that handle things that large. I would suggest posting a question with a reproducible example to stackoverflow asking how to compute cosine similarity between words for a large DocumentTermMatrix.

trinker / qdap

Correlation among all words in a term document matrix #228