Closed moodymudskipper closed 4 years ago
We could support, for by and along alike, the same possible forms as data.table proposes, removing
s()
instead heredata.table also supports :
a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
but I see no interest in this one
Advanced: When i is a list (or data.frame or data.table), DT[i, j, by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i. See this StackOverflow answer for a more detailed explanation until we roll out vignettes.
I think maybe we should forbid these joins, our semi joins should only filter, and when they don't, they should fail and propose to the user to use a proper join function
To validate the features of join and along, the following tests should be implemented :
.by
.by
s()
is supported in .by
col1:col2
is supported in .by
s()
is supported in formula notationcol1:col2
is supported in formula notationdone
dplyr and data.table both use the same grouping for "mutate by" and "summarize by".
data.table uses
.()
to summarize by, and:=
to mutate by (by reference), and the same bydplyr uses a different verb but the same
group_by()
we'll use the same arguments to summarize or mutate https://github.com/moodymudskipper/tb/issues/7 https://github.com/moodymudskipper/tb/issues/8, but won't use the same by
by
reducing number of rows keeps its name,by
as in "mutate by" is to be pronounced "along" and uses formulas.iris_dt[, maxSL := max(Sepal.Length), by = "Species"]
iris %>% group_by(Species) %>% mutate(maxSL = max(Sepal.Length))
iris_tb[maxSL = max(Sepal.Length) ~ Species]
I believe a conceptual difference is useful, but practically, we often need to "mutate by" and mutate at the same time, or mutate several columns by different groups.
The caveat is that you need to specify the "along" variable for each mutate operation (though in my experience it's never more than one, rarely two, at a time), if this is problematic we can have a "along" parameter to override the behavior in all mutate calls.
The notation can be
max(Sepal.Length) ~ Species + foo
ormax(Sepal.Length) ~ c("Species", "foo")
, or can uses()
as inby
https://github.com/moodymudskipper/tb/issues/6