Closed FinScience closed 7 years ago
No but I'd argue correlation is the wrong measure. I'd opt for cosine similarity instead: https://github.com/trinker/clustext/blob/master/R/cosine_distance.R
@trinker - Thanks a lot for the input. I am getting a few issues though. One thing is that my tdm is huge in size. So the size allocation error is popping up. I tried to it through clusters but it failed. Since I am pretty new with parallel R, a bit of assistance would be great. The code that I am using is shared below(the tdm is different though).
library(tm)
library(parallel)
if (!require("pacman")) install.packages("pacman")
pacman::p_load(clustext, dplyr, textshape, ggplot2, tidyr)
data("crude")
tdm <- TermDocumentMatrix(crude)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
clusterExport(cl,'tdm')
x = parLapply(cl,tdm, fun = cosine_distance.DocumentTermMatrix(tdm))
The output that I am looking for is something like below
Word Related_Word cosine_distance
oil opec 0.5
oil spill 0.3
.....................................................
.....................................................
I am pretty sure that I am not doing it the right way. Any assistance in this would be great
Please provide the error message. Also why are you using it in parallel? Have you tried it not in parallel?
I did try it normally. The dataset I have is huge. The one mentioned here is just for reference. The error with this code is as follows
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: c("'quote(structure(c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
0.183503419072274, ' is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, ' is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, ' is not a function, character or symbol", "'1, 1, 1, 0, 1, 1, 0.552786404500042, 1, 1, 1, 1, 1, 1, 1, 1, '
is not a function, character or symbol", "'1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ' is not a
function, character or symbol",
"'1, 1, 1, 1, 1, 1, 1, 1, 1, 0.776393202250021, 1, 1, 1, 1, 1, ' is not a function, character or symbol", "'1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ' is not a function, character or symbol", "'0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 0.422649730810374, 1, 1, 1, 1, ' is not a function, character or symbol", "'1, 1,
0.666666666666667, 1, 1, 1, 1, 1, 1, 1, 1, 0.51243357210427, ' is n
On using it on my dataset, without parallel processing the following is the error.
cosine_distance.DocumentTermMatrix(doc_term_mat)
Error: cannot allocate vector of size 1162.4 Gb
Gotcha. Making the Sorry but I don't think I have any tools that handle things that large. I would suggest posting a question with a reproducible example to stackoverflow asking how to compute cosine similarity between words for a large DocumentTermMatrix.
Hi, I have a term document matrix which after filtering has 815026 terms and 191431 docs. My idea is to get a correlation between a term and all other terms. I tried using
However, this is taking a huge amount of time even to find correlation of one term with all others. Is there any better way to approach this through qdap?