tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

implementation of group_modify() for multidplyr_party_df #102

Closed huisaddison closed 3 years ago

huisaddison commented 3 years ago

In a previous comment for issue #51, plans for implementation of group_modify() and group_map() were mentioned. However, the current version 0.0.0.9 does not appear to have it implemented.

(If this is an inappropriate venue to discuss planned feature development, please close this issue, though I would be interested to know where in the roadmap group_modify falls).

Reprex:

> library(tidyverse)
> library(multidplyr)
> 
> cluster = new_cluster(2)
> 
> df = tibble(x=rnorm(20),
+             y=rnorm(20),
+             k=rep(1:5, 4))
> 
> df %>% group_by(k) %>%
+   group_modify(~broom::tidy(lm(y ~ x, data=.x))) %>%
+   ungroup ()
# A tibble: 10 x 6
       k term        estimate std.error statistic p.value
   <int> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1     1 (Intercept)   0.634     0.0982     6.45   0.0232
 2     1 x            -1.48      0.172     -8.59   0.0133
 3     2 (Intercept)   0.212     0.180      1.18   0.360 
 4     2 x            -0.584     0.234     -2.49   0.130 
 5     3 (Intercept)   0.0886    0.307      0.288  0.800 
 6     3 x             0.483     0.822      0.588  0.616 
 7     4 (Intercept)  -1.26      0.809     -1.56   0.258 
 8     4 x             1.28      0.652      1.97   0.188 
 9     5 (Intercept)  -0.128     0.465     -0.276  0.808 
10     5 x            -0.0355    0.324     -0.110  0.923 
> 
> df %>% group_by(k) %>%
+   partition(cluster) %>%
+   group_modify(~broom::tidy(lm(y ~ x, data=.x))) %>%
+   ungroup () %>%
+   collect()
Error in UseMethod("group_modify") : 
  no applicable method for 'group_modify' applied to an object of class "multidplyr_party_df"
hadley commented 3 years ago

It's not currently on the roadmap because I don't think group_modify() is needed any more (see https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-summarise/ for details)

rxhu commented 2 years ago

@hadley, I disagree. goup_modify is still needed because the following is working with group_modify but not working with summarise no matter whether the tilde ~ is included or not:

iris %>% group_by(Species) %>% summarize(~broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x))) Error: Problem with summarise() input ..1. i ..1 = ~broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)). x ..1 must be a vector, not a formula object. i The error occurred in group 1: Species = setosa. Run rlang::last_error() to see where the error occurred.

domq commented 1 year ago

@rxhu, it works for me without the tilde (the first one that is), if you also remove data = .x:

library(dplyr)
iris %>% group_by(Species) %>% summarize(broom::tidy(lm(Petal.Length ~ Sepal.Length)))

... of course, the warning you get is not exactly encouraging:

Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.

I haven't been able to get reframe to work yet, like, at all — multidplyr or no. 🤷 Update: after doing my homework,

library(dplyr)
iris %>% reframe(.by = Species, broom::tidy(lm(Petal.Length ~ Sepal.Length)))

Getting back to the issue at hand, so to speak. Out of the group_modify(), summarise() and reframe() approaches, only the one with summarise() appears to currently be compatible with multidplyr (obsolescence warning notwithstanding):

library(dplyr)
library(multidplyr)
cluster <- new_cluster(2)
iris %>% group_by(Species) %>% partition(cluster) %>% summarize(broom::tidy(lm(Petal.Length ~ Sepal.Length)))

The other two approaches give off an error message like

Error in UseMethod("reframe") :
  no applicable method for 'reframe' applied to an object of class "multidplyr_party_df"

As far as the latter is concerned, I took the liberty to open #143 afresh.