Closed ahoho closed 5 years ago
You can't reorder across partitions easily, so I'm not sure what arrange should do.
I was thinking it would reorder within the chosen partitions (how arrange
used to work on grouped_df
objects in pre-v0.5 dplyr
).
For example, in the flight data, if we were to partition on flight number, we might want to arrange on departure time in order to calculate the time between flights.
Thanks Ax3man for the referral to this thread. Hi ahoho, Did you find a solution to this? I wanted to perform the same function - ordering inside each partition (not across several partitions).
Thank You ALL!!!
It's not particularly hard to do:
arrange_.party_df <- function (.data, ..., .dots = list())
{
multidplyr:::shard_call(.data, quote(dplyr::arrange), ..., .dots = .dots,
groups = .data$groups[-length(.data$groups)])
}
library(multidplyr)
mtcars %>%
partition(cyl) %>%
arrange(cyl, disp) %>%
collect() %>%
arrange(cyl)
You need to arrange on both cyl
and disp
within partition, in the case that the number of groups is larger than the number of partitions. You could adjust the function above to capture .data$partitions
and add that automatically.
To get consistent results after collection though, you'll need to arrange on the partition variable after collecting, since the groups get assigned to cores randomly. This would hurt any performance advantage you gained by doing the arranging on a cluster.
Whether it's a sensible thing to do is another matter.
Hi , Thanks Ax3man . I still don't see the issue is solved inside the package . Any solution within the near time ?
Fixed in dev version — arrange()
will now order within partition.
Is
arrange
just not implemented yet forparty_df
, or is there a reason that this verb doesn't make sense on partitioned data? Is the order of the data retained after partitioning?