markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

grouping with tidytable does not transfer to dplyr #691

Closed statzhero closed 1 year ago

statzhero commented 1 year ago

I suppose one cannot mix the dplyr and tidytable after grouping? I don't know why I thought this would work. Perhaps because something like tidytable::arrange() could be mixed into without any bad side effects I can think of.

> mtcars |> tidytable::group_by(cyl) |> dplyr::mutate(vs = vs * n()) |> count(vs)
# A grouped tidytable: 2 × 2
     vs     n
  <dbl> <int>
1     0    18
2    32    14
> mtcars |> tidytable::group_by(cyl) |> tidytable::mutate(vs = vs * n()) |> count(vs)
# A grouped tidytable: 3 × 2
     vs     n
  <dbl> <int>
1     0    18
2     7     4
3    11    10
moutikabdessabour commented 1 year ago

the issue stems from the difference between the S3classes returned by each method. the solution could be as easy as adding grouped_df in the class of the returned tidytable (or maybe would require more work).

Update: This won't be a quick fix as there is a major difference between how dplyr and data.table handle grouping, namely in data.table you only have to pass the grouping variable names to [,, by] argument. But dplyr computes the groups once group_by is called and saves them as an attribute on the data.frame. Making them compatible would require implementing something like grouped_df in tidytable.

> mtcars |> dplyr::group_by(cyl) %>% str
gropd_df [32 × 11] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num [1:32] 160 160 108 258 360 ...
 $ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
 $ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
 - attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ cyl  : num [1:3] 4 6 8
  ..$ .rows: list<int> [1:3] 
  .. ..$ : int [1:11] 3 8 9 18 19 20 21 26 27 28 ...
  .. ..$ : int [1:7] 1 2 4 6 10 11 30
  .. ..$ : int [1:14] 5 7 12 13 14 15 16 17 22 23 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
> mtcars |> tidytable::group_by(cyl) %>% str
Classes ‘grouped_tt’, ‘tidytable’, ‘data.table’ and 'data.frame':       32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "groups")= chr "cyl"
statzhero commented 1 year ago

Thanks! The development version of dplyr contains (so it says) a major speed improvement for the dplyr::group_by() function, so hopefully the gap will matter less.

It would still be good to highlight this issue for novices; that is, one should not mix the two packages, especially now that they have the same function names throughout.

markfairbanks commented 1 year ago

@moutikabdessabour more or less covered the reasoning but I'll leave my answer as well.

tidytable and dplyr take a different approach to grouping. tidytable adds a character vector to show which columns are used in the grouping. These variables are then passed to the by arg of data.table. At its most basic a "tidytable" is a subclass of a "data.table" that leaves all heavy computational work to the data.table library. This includes group calculation.

library(tidytable, w = FALSE)

df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3) %>%
  group_by(x, y)

class(df)
#> [1] "grouped_tt" "tidytable"  "data.table" "data.frame"

attr(df, "groups")
#> [1] "x" "y"

The group_by() in tidytable is a simple column selection. The actual computation of groups by data.table doesn't occur until you use mutate() or summarize() on a grouped tidytable.

dplyr on the other hand is in charge of it's own grouping calculation. It's a subclass of a tibble, and attaches a data frame with grouping information to its own version of the "groups" attribute. The calculation of group locations occurs when you call group_by().

library(dplyr, w = FALSE)

df <- tibble(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3) %>%
  group_by(x, y)

class(df)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

attr(df, "groups")
#> # A tibble: 2 × 3
#>   x     y           .rows
#>   <chr> <chr> <list<int>>
#> 1 a     a             [2]
#> 2 b     b             [1]

tidytable doesn't generate this same data frame with grouping information because data.table doesn't.

@statzhero is there a reason you need to mix them?

statzhero commented 1 year ago

Thanks! Re your question: only legacy code. The problem is that eventually a dplyr function will always pop up, e.g. dplyr::rows_patch(), that is not supported (I think).

markfairbanks commented 1 year ago

Ah gotcha. You're right, the rows_ functions aren't currently implemented in tidytable. I'll open an issue for those.

markfairbanks commented 1 year ago

In the future - when you do run into a function that isn't implemented can you open an issue to request them? It would help me to be sure I have everything implemented in tidytable.