jacob-long / panelr

Regression models and utilities for repeated measures and panel data
Other
100 stars 21 forks source link

Replace `panel_data` class with `tsibble`? #15

Open jacob-long opened 5 years ago

jacob-long commented 5 years ago

@earowang mentioned that it would be a good idea for panelr to accept tsibble objects as input to its functions, which should involve the creation of a as_panel_data() method for tsibble objects.

This should be fairly straightforward as id in panelr is comparable to key in tsibble and wave in panelr is comparable to index in tsibble. An important distinction is that tsibble objects can have multiple key columns. In any case, both panel_data and tsibble objects are basically modified grouped tibble objects, so it shouldn't be hard to reconcile the two. Earo notified me that my understanding of tsibble as a grouped tibble is not correct.

It might be a good idea for me to look at how tsibble handles multiple key values for some insight into how to deal with #13.

jacob-long commented 5 years ago

I've changed the title of this issue to reflect Earo's suggestion to me that panel_data be replaced by tsibble. Hopefully the new issue title will attract some attention from others because I'd want some outside feedback before going forward with it.

tsibble is a data class for temporal data that encompasses data like panels as well as several other formats that would not normally be considered panel data. Here is a paper Earo shared with me describing tsibble objects: https://pdf.earo.me/tsibble.pdf. I believe Earo and others have created many other resources documenting the tsibble package and object class.

Similar to the discussion in #9, there is an aspect of slightly redundant object classes here that could lead to potential confusion among users as well as areas in which packages/functions should be compatible but are not.

My first thought was that it might make sense to make panel_data inherit from tsibble; this would be consistent conceptually with the fact that panel data is a special class of temporal data. tsibble seeks to give a general way to store temporal data, of which panel data is one kind. Another way to put it, paraphrasing Earo's view, is that there is no real difference between time series and panel data until one starts modeling.

If panel_data inherited from tsibble, the drawback is that tidyverse methods do not preserve the panel_data class unless you do what I have done thus far and manually define S3 methods for those functions.

For example:

library(tibble)
library(tsibble)
new <- tsibble(
  qtr = rep(yearquarter("201001") + 0:9, 3),
  group = rep(c("x", "y", "z"), each = 10),
  value = rnorm(30),
  key = group
)
# Make a new class that inherits from tsibble
new <- new_tibble(new, nrow = nrow(new), class = c("new_tsibble", "tbl_ts"))
class(new)
# [1] "new_tsibble" "tbl_ts"      "tbl_df"      "tbl"         "data.frame" 

# Pass to tidyverse function
new %>% 
  dplyr::mutate(
    new_var = value
  ) %>% 
  class
# [1] "tbl_ts"     "tbl_df"     "tbl"        "data.frame"

Interestingly, this is no longer the case when inheriting only from tibble, although I'm almost certain it used to be.

new <- new_tibble(mtcars, nrow = nrow(mtcars), class = c("new_tibble"))
class(new)
# [1] "new_tibble" "tbl_df"     "tbl"        "data.frame"

new %>% 
  dplyr::mutate(
    new_var = mpg
  ) %>% 
  class
# [1] "new_tibble" "tbl_df"     "tbl"        "data.frame"

Anyway, one of the tidyverse principles is to reuse existing data structures and tsibble is clearly an existing data structure and not necessarily inconsistent with my understanding of what panel data are.

Of course, more than anything else my priority is another tidyverse principle, design for humans. On one hand, redundant data structures are human-unfriendly.

On the other hand, one of my goals is to lure people over from Stata and its xt suite of panel data commands, which exist somewhat separately from Stata's ts time series functions. That being said, if Stata did everything right we'd all be using that instead. At any rate, my top priority is for people who use this package to easily understand the correct structure of the data and how things work. Maybe I am wrong (part of the reason I ask for feedback), but my thought when developing this package is that there would be very little overlap between people wanting to use panelr and people wanting to do time series analysis, at least with the same data.

I want to make sure, as I said in #9, that panel data can be a first-class citizen in R. One of my concerns about deprecating the panel_data class is that it goes back to second-class citizen status and I will need to work on showing my students and others how to think of panel data as a special kind of time series data. Maybe I overstate this problem.

The other part that will have me dragging my feet is that this and at least one other package that depends on it will need major refactoring if I deprecate the panel_data class.

jacob-long commented 5 years ago

As I think and poke around further, to me the main problem is that subclasses created by new_tsibble() are not retained when used in tidyverse functions.

There are two other, related issues relating to how tsibble works compared to panel_data (although this is really about grouped tibble). Two pieces of functionality motivated my creation of the panel_data class:

  1. I wanted the use of lag() to work right. If you lag in wave 1, then you get NA back. As it is, this only works inside mutate() and transmute(), but that was good enough for me, for now.
  2. I wanted it to be really easy to calculate individual-level means, which like lag() is easy and correct within mutate() and transmute().

By default, applying these to tsibble objects results in the usual wrong values. There are good reasons for that, I think, because tsibble is a more general approach to temporal data that allows and requires the user to make an explicit decision about the implicit grouping (or non-grouping) in the data. These are not insurmountable problems, but paired with the issue about tidyverse functions dropping subclasses this results in a constant need to group the data.

alexpghayes commented 5 years ago

Stopping by to say +100 to using tsibble.

earowang commented 5 years ago

Thanks for documenting the thinking process. I'll add my thoughts in this thread.

Subclassing would make sense, if panel_data has additional attributes that tsibble doesn't have. We just phrase them differently depending on the context, without actual difference. But subclassing is useful for specialist methods such as print(), summary() and plotting, because they appear more like panel data.

In both the paper and the package, tsibble is addressed as temporal data frame rather than time series. (I'm going to de-emphasize the time series aspect of the tsibble in the paper for revision.) Thus, the identifying variable key is preferred over series/panels. I think panel data isn't a special kind of time series, but both are equally important as temporal data. I refer tsibble as time series objects most time, because the data that I deal with are long and the subsequent analysis are oriented towards time series forecasting.

Why tsibble isn't a grouped tibble?

  1. Operations applied to grouped data are slower than the whole data in terms of performance, although the results are equal.
library(dplyr, warn.conflicts = FALSE)
iris <- as_tibble(iris)
grped_iris <- group_by(iris, Species)
bench::mark(
  filter = filter(iris, Sepal.Width > 3.5),
  grouped_filter = filter(grped_iris, Sepal.Width > 3.5),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 filter           65.3µs   70.3µs    13732.    61.9KB    10.5 
#> 2 grouped_filter  108.4µs  122.2µs     7753.      14KB     8.80
  1. Grouped tibble/tsibble is a temporary state for the purpose of manipulation, but the key plays a permanent (and essential) role in tsibble.
  2. The tidyverse users would group_by() panels for grouped analysis and then ungroup(), if not going through the panelr docs. But they need to learn a new verb unpanel() in order to perform overall analysis. For panel data users, individual-level calculations by default for panel data are neat, but when switching to a general context (i.e. working with a data frame), lag() isn't working as they would expect unless explicitly group_by(). Maybe grouping is a pain, but we can leave this to the tidyverse team https://github.com/tidyverse/dplyr/issues/4166

Why tsibble strips off subclasses?

Subclasses are assumed to have some attributes. tsibble has no idea about how to handle those attributes. Similarly, dplyr and tibble don't know the strict ordering for a tsibble, the row-wise verbs for tsibble have to be defined in order for the change in ordering.


I wanted the use of lag() to work right. If you lag in wave 1, then you get NA back. As it is, this only works inside mutate() and transmute(), but that was good enough for me, for now.

doesn't tsibble give NA back in wave 1, if using dplyr::lag()?


Actually, replacing, subclassing, or coercion all sound good. I can do PRs on this, when you come to a decision.

jacob-long commented 5 years ago

Thanks for engaging on this, Earo.

I'm still thinking that subclassing is the likely route I would want to go. Although I was previously talking about some of the things that made me reluctant to even subclass, I want to be clear that tsibble is great and I want to encourage users to use those tools (and I don't want to duplicate them). Subclassing gives me some control over what I can consider to be valid input as well as avoiding breaking old code.

One issue with subclassing will be how to deal with multiple key columns, but that's a minor issue and will be something I can figure out.

As for grouping, I can see both pros and cons. I think it's right for tsibble not to be grouped, because it's not always appropriate (and for the performance reasons you raised), I just have not been thinking of panel_data in that way. I also have in mind the material I have out there suggesting panel_data objects are grouped. But I see your point about ungroup() vs. unpanel() and so on. Finding a way to get the lagging right is actually more important to me, which I have only been able to get right by grouping.

Let me give an example of what I consider "right" behavior with regard to lagged variables.

tribble(
  ~id, ~time, ~x,
  "A",     1,  2,
  "A",     2,  3,
  "A",     3,  4,
  "B",     1,  1,
  "B",     2,  2,
  "B",     3,  3,
  "C",     1,  5, 
  "C",     2,  6,
  "C",     3,  7
) %>%
  as_tsibble(
    key = id, index = time 
  ) %>%
  mutate(
    lag_x = lag(x)
  )

returns

# A tsibble: 9 x 4 [1]
# Key:       id [3]
  id     time     x lag_x
  <chr> <dbl> <dbl> <dbl>
1 A         1     2    NA
2 A         2     3     2
3 A         3     4     3
4 B         1     1     4
5 B         2     2     1
6 B         3     3     2
7 C         1     5     3
8 C         2     6     5
9 C         3     7     6

But what I consider to be the "correct" behavior (at least in the context of panel data) would return

# A tsibble: 9 x 4 [1]
# Key:       id [3]
  id     time     x lag_x
  <chr> <dbl> <dbl> <dbl>
1 A         1     2    NA
2 A         2     3     2
3 A         3     4     3
4 B         1     1    NA
5 B         2     2     1
6 B         3     3     2
7 C         1     5    NA
8 C         2     6     5
9 C         3     7     6

Of course, as you noted, grouping doesn't fix the behavior of lag() outside the tidyverse. plm has an interesting solution to this, making any column extracted from its pdata.frame class into a pseries vector that works sensibly with lag() outside of tidy functions. That said, I doubt it's workable in the tidyverse and I think it tends to perform slowly, though I'm not sure about that.

DavisVaughan commented 5 years ago

@jacob-long since you have said you care a lot about ensuring that lag() works correctly, you might also be interested in this issue https://github.com/tidyverts/tsibble/issues/55

@earowang I have a slightly optimistic feeling that some kind of vctrs integration into dplyr will help with the subclass dropping problem. Perhaps it will look like calling vec_restore() at the end of a tsibble function with the signature vec_restore(out, input_tbl) which dispatches on input_tbl. If panelr defined a vec_restore.panel_data method, that would be triggered and out could be restored to a panel data object after the tsibble/tidyverse operation. I'm not quite sure how that's going to be rolled out through dplyr but I think something along those lines might be the idea.

This discussion is sort of related to this issue, which is about restoring to a grouped data frame vs normal tibble as the output of a tidyr function https://github.com/r-lib/vctrs/issues/211

jacob-long commented 5 years ago

vctrs may hold the answer here. As @earowang mentioned in tidyverts/tsibble#55, if each column besides the identifier and time indicator(s) is a vec_ts object or some such, it can respect the gaps/missingness and presumably the grouped nature of the data.