Closed cluelessgumshoe closed 5 years ago
I think it is working, but advice on the method would still be awesome. If I replace the gather()'s in my functions with a do(g.fun1(.)) %>% where g.fun1 is a function loaded to the cluster that returns the gather() call it seems to work. also I had to move the collect() to above the ungroup()
gather()
comes from tidyr which is not currently supported by multidplyr.
We are having trouble getting a complicated dplyr transformation to work and I was hoping that multipdlr might be the answer. I think I have initialized the clusters correctly because it makes it up to the gather() function. Thinking about it, I can see why it might not work for a spread() function conceptually, but gather() seems like it could work across cores. Any tips would be appreciated! There could be typos (I changed var names before posting).
setup parallel
library(parallel) library(multidplyr)
cl <- create_cluster(detectCores()) set_default_cluster(cl)
tried initializing the cluster here first, then below later to try on the gather()
dataset1_cl <- dataset1 %>%
partition(fake_id, cluster = cl)
cluster_library(dataset1_cl,"tidyverse")
multidplyr:
q1_to <- dataset1 %>%
Remove empty randomized IDs
filter( is.na(fake_id) == FALSE & is.na(fake_a_id) == FALSE ) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,ends_with("ImportantTo")) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,starts_with("Q1")) %>% mutate_at( .funs = funs(as.character), .vars = vars(starts_with("Q1")) ) %>%
collect() %>%
partition(fake_id, cluster = cl) %>% gather(item, import_to, Q1A1_ImportantTo:Q1B13_ImportantTo) %>% mutate( import_to_n = as.numeric(import_to), item = gsub("ImportantTo","",item), id = as.factor(paste0(fake_a_id,item)) ) %>% group_by(id,fake_a_id,fake_id,company,agency,interviewer,a_date,item,import_to) %>% summarize( n = n_distinct(id), import_to_n = sum(import_to_n) ) %>% ungroup()