tidyverts / tsibble

Tidy Temporal Data Frames and Tools
https://tsibble.tidyverts.org
GNU General Public License v3.0
528 stars 50 forks source link

`count` and `tally` methods for grouped tsibbles would be helpful #232

Closed davidtedfordholt closed 2 years ago

davidtedfordholt commented 3 years ago

This is especially true for logging and testing input and output data in production systems. I think count and tally would return tibble objects with an n column, while add_count and add_tally would return tsibbles with the same key/index structure as the input.

I'm working on a PR for it at the moment, but need to make sure that it works for double-indexed tsibbles, as well as writing tests.

davidtedfordholt commented 3 years ago

It seems worthwhile to let people group by something other than the full key structure, but that it's worth pointing out when the tsibble is grouped and the person might be expecting the behavior of group_by_key().

My thoughts on behavior (am willing to do otherwise):

  1. count(tbl_ts) >> tbl
    • with as many rows as unique key combinations
  2. add_count(tbl_ts) >> tbl_ts
    • tbl_ts equivalent to input, with column n
  3. count(grouped_ts) >> tbl
    • with as many rows as unique groups.
    • If groups(grouped_ts) != key_vars(grouped_ts), then use groups but message that they aren't the same.
  4. add_count(grouped_ts) >> grouped_ts
    • with as many rows as unique key combinations.
    • keeping input groups seems to make more sense than mirroring summarise() and dropping one group in this case.
    • If groups(grouped_ts) != key_vars(grouped_ts), then use groups but message that they aren't the same. This would lead to a tsbl with each row including the count based on the group, not unique key combination.
    • This message perhaps should be a warn.
earowang commented 3 years ago

Yes, these methods would be helpful. The proposed behaviours look good, but better if no message(). PRs are welcome.

FYI, {dplyr} hasn't made count() generic yet https://github.com/tidyverse/dplyr/issues/5538

davidtedfordholt commented 3 years ago

Gotcha! I will get on it. Would you prefer I create the generic for count() and default to dplyr::count(), or just write it as though count() is generic and just wait on the merge until it is?

davidtedfordholt commented 3 years ago

Generic count() and tally() are there, so I'm working on a PR.

davidtedfordholt commented 3 years ago

@earowang , would you expect a tibble that includes both State and Animal from this call, or just State?

tsibbledata::aus_livestock %>%
  group_by(Animal) %>%
  count(State)

I would expect one that only includes State and n, but I can see an argument both ways. In particular, the behavior in dplyr is to use both in the grouping.

earowang commented 3 years ago

To be consistent with {dplyr} behaviours, I think they include Animal and State in the output, but keep Animal as the grouping variable in the end.