tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

unable to get gather to work with multidplyr #71

Closed cluelessgumshoe closed 5 years ago

cluelessgumshoe commented 5 years ago

We are having trouble getting a complicated dplyr transformation to work and I was hoping that multipdlr might be the answer. I think I have initialized the clusters correctly because it makes it up to the gather() function. Thinking about it, I can see why it might not work for a spread() function conceptually, but gather() seems like it could work across cores. Any tips would be appreciated! There could be typos (I changed var names before posting).

setup parallel

library(parallel) library(multidplyr)

cl <- create_cluster(detectCores()) set_default_cluster(cl)

tried initializing the cluster here first, then below later to try on the gather()

dataset1_cl <- dataset1 %>%

partition(fake_id, cluster = cl)

cluster_library(dataset1_cl,"tidyverse")

multidplyr:

q1_to <- dataset1 %>%

Remove empty randomized IDs

filter( is.na(fake_id) == FALSE & is.na(fake_a_id) == FALSE ) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,ends_with("ImportantTo")) %>% select(fake_a_id,fake_id,company,agency,interviewer,a_date,starts_with("Q1")) %>% mutate_at( .funs = funs(as.character), .vars = vars(starts_with("Q1")) ) %>%

collect() %>%

partition(fake_id, cluster = cl) %>% gather(item, import_to, Q1A1_ImportantTo:Q1B13_ImportantTo) %>% mutate( import_to_n = as.numeric(import_to), item = gsub("ImportantTo","",item), id = as.factor(paste0(fake_a_id,item)) ) %>% group_by(id,fake_a_id,fake_id,company,agency,interviewer,a_date,item,import_to) %>% summarize( n = n_distinct(id), import_to_n = sum(import_to_n) ) %>% ungroup()

cluelessgumshoe commented 5 years ago
cluelessgumshoe commented 5 years ago

I think it is working, but advice on the method would still be awesome. If I replace the gather()'s in my functions with a do(g.fun1(.)) %>% where g.fun1 is a function loaded to the cluster that returns the gather() call it seems to work. also I had to move the collect() to above the ungroup()

hadley commented 5 years ago

gather() comes from tidyr which is not currently supported by multidplyr.