cluelessgumshoe commented 5 years ago

We are having trouble getting a complicated dplyr transformation to work and I was hoping that multipdlr might be the answer. I think I have initialized the clusters correctly because it makes it up to the gather() function. Thinking about it, I can see why it might not work for a spread() function conceptually, but gather() seems like it could work across cores. Any tips would be appreciated! There could be typos (I changed var names before posting).

setup parallel

library(parallel) library(multidplyr)

cl <- create_cluster(detectCores()) set_default_cluster(cl)

tried initializing the cluster here first, then below later to try on the gather()

dataset1_cl <- dataset1 %>%

partition(fake_id, cluster = cl)

cluster_library(dataset1_cl,"tidyverse")

multidplyr:

q1_to <- dataset1 %>%

Remove empty randomized IDs

filter( is.na(fake_id) == FALSE & is.na(fake_a_id) == FALSE ) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,ends_with("ImportantTo")) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,starts_with("Q1")) %>% mutate_at( .funs = funs(as.character), .vars = vars(starts_with("Q1")) ) %>%

collect() %>%

partition(fake_id, cluster = cl) %>% gather(item, import_to, Q1A1_ImportantTo:Q1B13_ImportantTo) %>% mutate( import_to_n = as.numeric(import_to), item = gsub("ImportantTo","",item), id = as.factor(paste0(fake_a_id,item)) ) %>% group_by(id,fake_a_id,fake_id,company,agency,interviewer,a_date,item,import_to) %>% summarize( n = n_distinct(id), import_to_n = sum(import_to_n) ) %>% ungroup()

cluelessgumshoe commented 5 years ago

ungroup() %>%
collect() Error in UseMethod("gather") : no applicable method for 'gather' applied to an object of class "party_df"

cluelessgumshoe commented 5 years ago

I think it is working, but advice on the method would still be awesome. If I replace the gather()'s in my functions with a do(g.fun1(.)) %>% where g.fun1 is a function loaded to the cluster that returns the gather() call it seems to work. also I had to move the collect() to above the ungroup()

hadley commented 5 years ago

gather() comes from tidyr which is not currently supported by multidplyr.

tidyverse / multidplyr

unable to get gather to work with multidplyr #71

setup parallel

tried initializing the cluster here first, then below later to try on the gather()

dataset1_cl <- dataset1 %>%

partition(fake_id, cluster = cl)

multidplyr:

Remove empty randomized IDs

collect() %>%