nathaneastwood / poorman

A poor man's dependency free grammar of data manipulation
https://nathaneastwood.github.io/poorman/
Other
339 stars 15 forks source link

Improve speed of `group_by()` #121

Open etiennebacher opened 1 year ago

etiennebacher commented 1 year ago

Related to #119.

Benchmark setup:

d <- data.frame(
  g1 = sample(LETTERS, 8000, TRUE),
  g2 = sample(LETTERS, 8000, TRUE),
  g3 = sample(LETTERS, 8000, TRUE),
  x1 = runif(8000),
  x2 = runif(8000),
  x3 = runif(8000)
)

# return a list of results so that both functions return the same output (without
# all the class problem, tibble vs data.frame)
poor = function() {
  foo <- d %>% 
    group_by(g1, g2, g3) |> 
    summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

dpl = function() {
  foo <- d %>% 
    dplyr::group_by(g1, g2, g3) |> 
    dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

bench::mark(
  poor(),
  dpl()
)

Before:

#> # A tibble: 2 × 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time           gc      
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>         <list>  
#> 1 poor()        11.6s    11.6s    0.0863    1.83GB    11.0      1   127      11.6s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl()       157.2ms  178.3ms    5.77      7.34MB     5.77     3     3    520.2ms <list [3]> <Rprofmem> <bench_tm [3]> <tibble>

After:

#> # A tibble: 2 × 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time           gc      
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>         <list>  
#> 1 poor()        7.77s    7.77s     0.129    1.83GB     9.78     1    76      7.77s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl()      112.42ms 152.17ms     6.66     3.39MB     3.33     4     2   601.02ms <list [3]> <Rprofmem> <bench_tm [4]> <tibble>
codecov-commenter commented 1 year ago

Codecov Report

Base: 93.59% // Head: 93.60% // Increases project coverage by +0.01% :tada:

Coverage data is based on head (e6e28e8) compared to base (3cc0a99). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #121 +/- ## ========================================== + Coverage 93.59% 93.60% +0.01% ========================================== Files 59 59 Lines 1483 1486 +3 ========================================== + Hits 1388 1391 +3 Misses 95 95 ``` | [Impacted Files](https://codecov.io/gh/nathaneastwood/poorman/pull/121?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nathan+Eastwood) | Coverage Δ | | |---|---|---| | [R/group\_utils.R](https://codecov.io/gh/nathaneastwood/poorman/pull/121?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nathan+Eastwood#diff-Ui9ncm91cF91dGlscy5S) | `95.12% <100.00%> (+0.38%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nathan+Eastwood). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nathan+Eastwood)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

etiennebacher commented 1 year ago

TODO: the following code doesn't produce the right output (note that the output is also wrong without this PR):

suppressPackageStartupMessages(library(poorman))

d <- data.frame(
  orig = rep(c("France", "UK"), each = 4),
  dest = rep(c("Spain", "Germany"), times = 4),
  year = rep(rep(c(2010, 2011), each = 2), 2),
  value = 1:8
)
d[2, 1] <- NA
d[7, 2] <- NA

d
#>     orig    dest year value
#> 1 France   Spain 2010     1
#> 2   <NA> Germany 2010     2
#> 3 France   Spain 2011     3
#> 4 France Germany 2011     4
#> 5     UK   Spain 2010     5
#> 6     UK Germany 2010     6
#> 7     UK    <NA> 2011     7
#> 8     UK Germany 2011     8

d %>%
  group_by(orig, dest) %>%
  group_data()
#>     orig    dest .rows
#> 1 France Germany     4
#> 2 France   Spain  1, 3
#> 3     UK Germany  6, 8
#> 4     UK   Spain     5
#> 5     UK    <NA>      
#> 6   <NA> Germany  2, 7

The groups UK-NA and NA-Germany should have 1 row each.

nathaneastwood commented 1 year ago

I've been away. I had every intention of reviewing this this evening but I spent it catching up on other things. I'll try and look tomorrow after work.