r-lib / rray

Simple Arrays
https://rray.r-lib.org
GNU General Public License v3.0
129 stars 12 forks source link

rray_summarise() #231

Open juangomezduaso opened 5 years ago

juangomezduaso commented 5 years ago

Function tapply() is the obvious way to produce arrays from data frames. But users of dplyr have other aggregation functionality that keeps them in the realm of the tidy dataset format. Perhaps it would be useful to ease their ocassional jumps to array computing offering a kind of tapply() tailored to their conventions. If you dare to infringe Hadley Wickham's function names copyright, a simple example of this could be:

library(rray)
library(dplyr)
library(gapminder)

rray_summarise <- function(grtib,exp=1,FUN=sum,... ){
  e<-rlang::enexpr(exp)
  tib2 <- dplyr::transmute(grtib, `*var*`=!!e)
  as_rray( tapply(tib2$`*var*`,
                  INDEX= as.list(tib2[attr(tib2,"vars")]),
                  FUN=FUN,...)
  )
}
gapminder %>% group_by(continent, year) %>% rray_summarise(pop/1000)
#> <rray<dbl>[,12][60]>
#>           year
#> continent        1952       1957       1962       1967      1972      1977
#>   Africa    237640.50  264837.74  296516.86  335289.49  379879.5  433061.0
#>   Americas  345152.45  386953.92  433270.25  480746.62  529384.2  578067.7
#>   Asia     1395357.35 1562780.60 1696357.18 1905662.90 2150972.2 2384513.6
#>   Europe    418120.85  437890.35  460355.15  481178.96  500635.1  517164.5
#>   Oceania    10686.01   11941.98   13283.52   14600.41   16106.1   17239.0
#>           year
#> continent        1982       1987       1992       1997       2002
#>   Africa    499348.59  574834.11  659081.52  743832.98  833723.92
#>   Americas  630290.92  682753.97  739274.10  796900.41  849772.76
#>   Asia     2610135.58 2871220.76 3133292.19 3383285.50 3601802.20
#>   Europe    531266.90  543094.16  558142.80  568944.15  578223.87
#>   Oceania    18394.85   19574.42   20919.65   22241.43   23454.83
#>           year
#> continent        2007
#>   Africa    929539.69
#>   Americas  898871.18
#>   Asia     3811953.83
#>   Europe    586098.53
#>   Oceania    24549.95

Created on 2019-06-20 by the reprex package (v0.2.1)

juangomezduaso commented 5 years ago

The main difference dplyr's group_by+summarise has with tapply (and thus, with a refined rray_sumarise) is the groups we are considering. In the first case groups are formed based on the data, an so only the combinations actually present in the data are returned. In the second, "a priori" clasifications are prescribed in the form of factor variables, and an exhaustive crossing of them will be the returned result no matter what the data set actually contains. Not only some individual cells, but even entire rows with no data will be in the result as long as their factor level was prescribed. The order of the levels would be kept as well. This predictable result seems preferable in aggregate production automation scenarios.

This is an obvious aclaration, but I think it is important here as another justification (besides the ability to operate aggregates of diferent granularities thanks to rray broadcasting, of course) of why a functionality like this complements what dplyr offers now. My rray_summarise() function based on current dplyr::group_by() doesn't address this completely. (In terms of dplyr's issue#4392 , I am solving the "expand" part) But in view of https://github.com/tidyverse/dplyr/issues/4392#issuecomment-497309434 it could change to something completely different.