Open juangomezduaso opened 5 years ago
The main difference dplyr's group_by+summarise has with tapply (and thus, with a refined rray_sumarise) is the groups we are considering. In the first case groups are formed based on the data, an so only the combinations actually present in the data are returned. In the second, "a priori" clasifications are prescribed in the form of factor variables, and an exhaustive crossing of them will be the returned result no matter what the data set actually contains. Not only some individual cells, but even entire rows with no data will be in the result as long as their factor level was prescribed. The order of the levels would be kept as well. This predictable result seems preferable in aggregate production automation scenarios.
This is an obvious aclaration, but I think it is important here as another justification (besides the ability to operate aggregates of diferent granularities thanks to rray broadcasting, of course) of why a functionality like this complements what dplyr offers now. My rray_summarise() function based on current dplyr::group_by() doesn't address this completely. (In terms of dplyr's issue#4392 , I am solving the "expand" part) But in view of https://github.com/tidyverse/dplyr/issues/4392#issuecomment-497309434 it could change to something completely different.
Function tapply() is the obvious way to produce arrays from data frames. But users of dplyr have other aggregation functionality that keeps them in the realm of the tidy dataset format. Perhaps it would be useful to ease their ocassional jumps to array computing offering a kind of tapply() tailored to their conventions. If you dare to infringe Hadley Wickham's function names copyright, a simple example of this could be:
Created on 2019-06-20 by the reprex package (v0.2.1)