tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.74k stars 2.12k forks source link

Poor performance for group_by & do for larger n and incorrect progress bar #1921

Closed ghaarsma closed 8 years ago

ghaarsma commented 8 years ago

In the current release of dplyr 0.4.3 when running the code below, the time to finish grows N(o^3). Also for larger n (loops), the progress bar is incorrect. It shows that it is done, but then just seems to hang at the end, until the call completes (which can be much later).

## Syntetic data example

f_do <- function(df) {
  data.frame(x2=sum(df$x2),stringsAsFactors = FALSE)
}

loops <- seq(1e4,10e4,by=1e4)
time  <- vector(mode = 'numeric',length(loops))
for (i in 1:length(loops)) {
  dat <- data.frame(x1=1:loops[i],x2=floor(runif(loops[i],1,10)))
  t <- system.time(r <- group_by(dat,x1) %>% do(f_do(.)))
  time[i] <- t['elapsed']
}
ggplot2::qplot(loops,time)
ggplot2::qplot(loops,time^(1/3))

The progress bar indication seems to grow linearly with the number of loops (as it should), but the actual time to finish grows N(o^3). A number of loops goes up, the time it seems to hang after the progress bar is complete grows.

I can't seem to reproduce the problem in the current Development/Master version, so perhaps the problem is already addressed, but I could not find a matching issue.

hadley commented 8 years ago

Also note that the progress bar only reflects computation time: there's often a little time at the end needed to join all the pieces together.

ghaarsma commented 8 years ago

Did a little bit more testing. Indeed the computation time in both dplyr 0.4.3 and current master is O(n) and the progress bar is only measuring this.

For dplyr 0.4.3 the joining (done by rbind_all) is of O(n^3). Seems to been addressed in the current master where the joining is done by bind_rows and of O(n). Perhaps related to: #1396?