Open trinker opened 7 years ago
Dump everything out to temp rds and read back to the clusters...add a library arg
Initial attempts leads to error on Windows (parallel seems to be using an old version of R and throws an error with regard to Rcpp being the wrong version fixed this by using newer version of R on path but now an error related to sentimentr indicating still an old version???). Maybe need to remove all R from path??
if (!require("pacman")) install.packages("pacman")
pacman::p_load(sentimentr, parallel, textshape, dplyr)
chunk_size <- 1e5
dir.create('data')
dat <- combine_data() %>%
{.[rep(seq_len(nrow(.)), 100),]} %>%
sample_n(nrow(.)) %>%
split_index({inds <- chunk_size * 1:round(nrow(.)/chunk_size, 0); inds[inds < nrow(.)]})
tic <- Sys.time()
cl <- makeCluster(mc <- getOption("cl.cores", detectCores() - 2))
clusterEvalQ(cl, {
library(sentimentr)
library(lexicon)
})
parLapply(cl, dat, function(x){
gc()
senti_dat <- sentimentr::get_sentences(x)
senti_dat <- sentimentr::sentiment_by(senti_dat)
outfile <- sprintf('data/file_%s.rds', sample(1:100000))
saveRDS(senti_dat, outfile)
}) %>%
invisible()
stopCluster(cl)
Sys.time() - tic
Results in:
Error in checkForRemoteErrors(val) :
6 nodes produced errors; first error: 'get_sentences' is not an exported object from 'namespace:sentimentr'
http://appliedpredictivemodeling.com/blog/2018/1/17/parallel-processing
Is either of the following a better way to run parallel code:
https://github.com/r-lib/callr https://github.com/r-lib/processx
A OS independent solution is needed. Re investigate available solutions and reach out to the R community for current best practices.
Here's where I ask the R community: https://twitter.com/tylerrinker/status/1044364197797265408
Some other packages:
A parallel option that runs
sentiment
andsentiment_by
on multiple cores