tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

`party_df` class doesn't support an `arrange` method #49

Closed ahoho closed 5 years ago

ahoho commented 7 years ago

Is arrange just not implemented yet for party_df, or is there a reason that this verb doesn't make sense on partitioned data? Is the order of the data retained after partitioning?

hadley commented 7 years ago

You can't reorder across partitions easily, so I'm not sure what arrange should do.

ahoho commented 7 years ago

I was thinking it would reorder within the chosen partitions (how arrange used to work on grouped_df objects in pre-v0.5 dplyr).

For example, in the flight data, if we were to partition on flight number, we might want to arrange on departure time in order to calculate the time between flights.

kyp0717 commented 7 years ago

Thanks Ax3man for the referral to this thread. Hi ahoho, Did you find a solution to this? I wanted to perform the same function - ordering inside each partition (not across several partitions).

Thank You ALL!!!

Ax3man commented 7 years ago

It's not particularly hard to do:

arrange_.party_df <- function (.data, ..., .dots = list()) 
{
  multidplyr:::shard_call(.data, quote(dplyr::arrange), ..., .dots = .dots, 
             groups = .data$groups[-length(.data$groups)])
}

library(multidplyr)

mtcars %>% 
  partition(cyl) %>% 
  arrange(cyl, disp) %>% 
  collect() %>%
  arrange(cyl)

You need to arrange on both cyl and disp within partition, in the case that the number of groups is larger than the number of partitions. You could adjust the function above to capture .data$partitions and add that automatically.

To get consistent results after collection though, you'll need to arrange on the partition variable after collecting, since the groups get assigned to cores randomly. This would hurt any performance advantage you gained by doing the arranging on a cluster.

Whether it's a sensible thing to do is another matter.

snassimr commented 6 years ago

Hi , Thanks Ax3man . I still don't see the issue is solved inside the package . Any solution within the near time ?

hadley commented 5 years ago

Fixed in dev version — arrange() will now order within partition.