Closed statzhero closed 1 year ago
the issue stems from the difference between the S3classes returned by each method. the solution could be as easy as adding grouped_df
in the class of the returned tidytable
(or maybe would require more work).
Update:
This won't be a quick fix as there is a major difference between how dplyr and data.table handle grouping, namely in data.table you only have to pass the grouping variable names to [,, by]
argument. But dplyr computes the groups once group_by is called and saves them as an attribute on the data.frame. Making them compatible would require implementing something like grouped_df
in tidytable.
> mtcars |> dplyr::group_by(cyl) %>% str
gropd_df [32 × 11] (S3: grouped_df/tbl_df/tbl/data.frame)
$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num [1:32] 160 160 108 258 360 ...
$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:11] 3 8 9 18 19 20 21 26 27 28 ...
.. ..$ : int [1:7] 1 2 4 6 10 11 30
.. ..$ : int [1:14] 5 7 12 13 14 15 16 17 22 23 ...
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
> mtcars |> tidytable::group_by(cyl) %>% str
Classes ‘grouped_tt’, ‘tidytable’, ‘data.table’ and 'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "groups")= chr "cyl"
Thanks! The development version of dplyr
contains (so it says) a major speed improvement for the dplyr::group_by()
function, so hopefully the gap will matter less.
It would still be good to highlight this issue for novices; that is, one should not mix the two packages, especially now that they have the same function names throughout.
@moutikabdessabour more or less covered the reasoning but I'll leave my answer as well.
tidytable
and dplyr
take a different approach to grouping. tidytable
adds a character vector to show which columns are used in the grouping. These variables are then passed to the by
arg of data.table
. At its most basic a "tidytable" is a subclass of a "data.table" that leaves all heavy computational work to the data.table
library. This includes group calculation.
library(tidytable, w = FALSE)
df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3) %>%
group_by(x, y)
class(df)
#> [1] "grouped_tt" "tidytable" "data.table" "data.frame"
attr(df, "groups")
#> [1] "x" "y"
The group_by()
in tidytable
is a simple column selection. The actual computation of groups by data.table
doesn't occur until you use mutate()
or summarize()
on a grouped tidytable.
dplyr
on the other hand is in charge of it's own grouping calculation. It's a subclass of a tibble, and attaches a data frame with grouping information to its own version of the "groups" attribute. The calculation of group locations occurs when you call group_by()
.
library(dplyr, w = FALSE)
df <- tibble(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3) %>%
group_by(x, y)
class(df)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
attr(df, "groups")
#> # A tibble: 2 × 3
#> x y .rows
#> <chr> <chr> <list<int>>
#> 1 a a [2]
#> 2 b b [1]
tidytable
doesn't generate this same data frame with grouping information because data.table
doesn't.
@statzhero is there a reason you need to mix them?
Thanks! Re your question: only legacy code. The problem is that eventually a dplyr
function will always pop up, e.g. dplyr::rows_patch()
, that is not supported (I think).
Ah gotcha. You're right, the rows_
functions aren't currently implemented in tidytable. I'll open an issue for those.
In the future - when you do run into a function that isn't implemented can you open an issue to request them? It would help me to be sure I have everything implemented in tidytable
.
I suppose one cannot mix the
dplyr
andtidytable
after grouping? I don't know why I thought this would work. Perhaps because something liketidytable::arrange()
could be mixed into without any bad side effects I can think of.