moodymudskipper / tb

IN ~PROGRESS my own take on `[.data.frame`
0 stars 0 forks source link

Formula notation to mean "along" #9

Closed moodymudskipper closed 4 years ago

moodymudskipper commented 5 years ago

dplyr and data.table both use the same grouping for "mutate by" and "summarize by".

I believe a conceptual difference is useful, but practically, we often need to "mutate by" and mutate at the same time, or mutate several columns by different groups.

The caveat is that you need to specify the "along" variable for each mutate operation (though in my experience it's never more than one, rarely two, at a time), if this is problematic we can have a "along" parameter to override the behavior in all mutate calls.

The notation can be max(Sepal.Length) ~ Species + foo or max(Sepal.Length) ~ c("Species", "foo") , or can use s() as in by https://github.com/moodymudskipper/tb/issues/6

moodymudskipper commented 4 years ago

We could support, for by and along alike, the same possible forms as data.table proposes, removing

data.table also supports :

a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., DT[, sum(a), by="x,y,z"]

but I see no interest in this one

Advanced: When i is a list (or data.frame or data.table), DT[i, j, by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i. See this StackOverflow answer for a more detailed explanation until we roll out vignettes.

I think maybe we should forbid these joins, our semi joins should only filter, and when they don't, they should fail and propose to the user to use a proper join function

To validate the features of join and along, the following tests should be implemented :

moodymudskipper commented 4 years ago

done