tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

Implement `distinct` #117

Open PaulinaUrban opened 3 years ago

PaulinaUrban commented 3 years ago

I have a problem with calculations on few cores using multidplyr in R. I have a data to which i give a number (data will be grouped by number and data with number 1 will be sens to cluster 1 etc.) like in code below:


group <- rep(1:cores, length.out = nrow(dane))

dane <- bind_cols(tibble(group), dane)

cluster <- multidplyr::new_cluster(cores)

dane <-
  dane %>%
  group_by(group) %>%
  partition(cluster) 

Also, I send to each cluster which will be calculating library, other values and functions.

After data is split and send to cluster I want to start calculations and collect results:

dane %>% select() %>% distinct() %>% ...

but unfortunatelly I have this error and I don't know what to do to solve this problem [instead of distinct(), I use unique but other error show.]

"Error in command 'UseMethod ("distinct")': inapplicable method for 'distinct' applied to the class object "multidplyr_party_df""

hadley commented 3 years ago

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls.

PaulinaUrban commented 3 years ago

Dear Hadley, unfortunatelly reprex() gives strange errors when I try to make example code

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
numCores <- detectCores()
#> Error in detectCores(): nie udało się znaleźć funkcji 'detectCores'
cores <- numCores - 4
#> Error in eval(expr, envir, enclos): nie znaleziono obiektu 'numCores'
group <- rep(1:cores, length.out = nrow(flights))
#> Error in eval(expr, envir, enclos): nie znaleziono obiektu 'cores'
flights <- bind_cols(tibble(group), flights)
#> Error in eval_tidy(xs[[j]], mask): nie znaleziono obiektu 'group'
cluster <- multidplyr::new_cluster(cores)
#> Error in integer(n): nie znaleziono obiektu 'cores'
View(flights)
flights <-
+     flights %>%
+     group_by(group) %>%
+     partition(cluster) 
#> Error in FUN(left): niepoprawny argument przekazany do operatora jednoargumentowego
cluster_library(cluster,"tidyverse")
#> Error in is_cluster(cluster): nie znaleziono obiektu 'cluster'
cluster_library(cluster,"tidytext")
#> Error in is_cluster(cluster): nie znaleziono obiektu 'cluster'
cluster_library(cluster,"dplyr")
#> Error in is_cluster(cluster): nie znaleziono obiektu 'cluster'
cluster_copy(cluster, 'flights')
#> Error in is_cluster(cluster): nie znaleziono obiektu 'cluster'
flights <-
+     flights %>%
+     select(contains("dest"), everything()) %>%
+     select(`ID`=1, group = 2, abstract=3) %>%
+     distinct() 
#> Error in FUN(left): niepoprawny argument przekazany do operatora jednoargumentowego

So I paste normal code with data which is available for everyone (from package nycflights13) and gives the same error as in my situation:


library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

numCores <- detectCores()
cores <- numCores - 4
group <- rep(1:cores, length.out = nrow(flights))
flights <- bind_cols(tibble(group), flights)
cluster <- multidplyr::new_cluster(cores)

flights <-
  flights %>%
  group_by(group) %>%
  partition(cluster) 

cluster_library(cluster,"tidyverse")
cluster_library(cluster,"tidytext")
cluster_library(cluster,"dplyr")
cluster_copy(cluster, 'flights')

flights <-
  flights %>%
  select(contains("dest"), everything()) %>%
  select(`dest`=1, group = 2, origin=3) %>%
  distinct() %>%
  collect()```

When You put this code into Rstudio console and run it You will have error like this: Error in command 'UseMethod ("distinct")': inapplicable method for 'distinct' applied to the class object "multidplyr_party_df"
hadley commented 3 years ago

Here is a minimal reprex:

library(multidplyr)
library(dplyr, warn.conflicts = FALSE)

cluster <- multidplyr::new_cluster(2)

mtcars2 <- partition(mtcars, cluster)
mtcars2 %>% distinct()
#> Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "multidplyr_party_df"

Created on 2021-05-21 by the reprex package (v2.0.0)

Looks like I forgot to provide a distinct method.

PaulinaUrban commented 3 years ago

Dear Hadley, Now I understand the error - now the question is: will You in the nearest future add this method distinct() to package multidplyr or how can I add this method in my code?

hadley commented 3 years ago

I will add it next time I work on multidplyr.

pwwang commented 3 years ago

group_map() family has the same issue.

Tkastylevsky commented 2 years ago

Any chance one of you came up with a fix for this ?

JohannesFriedrich commented 2 months ago

I will add it next time I work on multidplyr.

This comment is now 3 years old, is there and maintenance planend to fix the mentioned issu(es)?

I would be very interested in a further development of the package.

hadley commented 2 months ago

@JohannesFriedrich I don't have time to work on it right now, but I'd be happy to review PRs. (And I don't think this would be that hard to fix following the template of the other methods.)