Open jacob-long opened 5 years ago
I've changed the title of this issue to reflect Earo's suggestion to me that panel_data
be replaced by tsibble
. Hopefully the new issue title will attract some attention from others because I'd want some outside feedback before going forward with it.
tsibble
is a data class for temporal data that encompasses data like panels as well as several other formats that would not normally be considered panel data. Here is a paper Earo shared with me describing tsibble
objects: https://pdf.earo.me/tsibble.pdf. I believe Earo and others have created many other resources documenting the tsibble
package and object class.
Similar to the discussion in #9, there is an aspect of slightly redundant object classes here that could lead to potential confusion among users as well as areas in which packages/functions should be compatible but are not.
My first thought was that it might make sense to make panel_data
inherit from tsibble
; this would be consistent conceptually with the fact that panel data is a special class of temporal data. tsibble
seeks to give a general way to store temporal data, of which panel data is one kind. Another way to put it, paraphrasing Earo's view, is that there is no real difference between time series and panel data until one starts modeling.
If panel_data
inherited from tsibble
, the drawback is that tidyverse methods do not preserve the panel_data
class unless you do what I have done thus far and manually define S3 methods for those functions.
For example:
library(tibble)
library(tsibble)
new <- tsibble(
qtr = rep(yearquarter("201001") + 0:9, 3),
group = rep(c("x", "y", "z"), each = 10),
value = rnorm(30),
key = group
)
# Make a new class that inherits from tsibble
new <- new_tibble(new, nrow = nrow(new), class = c("new_tsibble", "tbl_ts"))
class(new)
# [1] "new_tsibble" "tbl_ts" "tbl_df" "tbl" "data.frame"
# Pass to tidyverse function
new %>%
dplyr::mutate(
new_var = value
) %>%
class
# [1] "tbl_ts" "tbl_df" "tbl" "data.frame"
Interestingly, this is no longer the case when inheriting only from tibble
, although I'm almost certain it used to be.
new <- new_tibble(mtcars, nrow = nrow(mtcars), class = c("new_tibble"))
class(new)
# [1] "new_tibble" "tbl_df" "tbl" "data.frame"
new %>%
dplyr::mutate(
new_var = mpg
) %>%
class
# [1] "new_tibble" "tbl_df" "tbl" "data.frame"
Anyway, one of the tidyverse principles is to reuse existing data structures and tsibble
is clearly an existing data structure and not necessarily inconsistent with my understanding of what panel data are.
Of course, more than anything else my priority is another tidyverse principle, design for humans. On one hand, redundant data structures are human-unfriendly.
On the other hand, one of my goals is to lure people over from Stata and its xt
suite of panel data commands, which exist somewhat separately from Stata's ts
time series functions. That being said, if Stata did everything right we'd all be using that instead. At any rate, my top priority is for people who use this package to easily understand the correct structure of the data and how things work. Maybe I am wrong (part of the reason I ask for feedback), but my thought when developing this package is that there would be very little overlap between people wanting to use panelr
and people wanting to do time series analysis, at least with the same data.
I want to make sure, as I said in #9, that panel data can be a first-class citizen in R. One of my concerns about deprecating the panel_data
class is that it goes back to second-class citizen status and I will need to work on showing my students and others how to think of panel data as a special kind of time series data. Maybe I overstate this problem.
The other part that will have me dragging my feet is that this and at least one other package that depends on it will need major refactoring if I deprecate the panel_data
class.
As I think and poke around further, to me the main problem is that subclasses created by new_tsibble()
are not retained when used in tidyverse functions.
There are two other, related issues relating to how tsibble
works compared to panel_data
(although this is really about grouped tibble
). Two pieces of functionality motivated my creation of the panel_data
class:
lag()
to work right. If you lag in wave 1, then you get NA back. As it is, this only works inside mutate()
and transmute()
, but that was good enough for me, for now.lag()
is easy and correct within mutate()
and transmute()
. By default, applying these to tsibble
objects results in the usual wrong values. There are good reasons for that, I think, because tsibble
is a more general approach to temporal data that allows and requires the user to make an explicit decision about the implicit grouping (or non-grouping) in the data. These are not insurmountable problems, but paired with the issue about tidyverse functions dropping subclasses this results in a constant need to group the data.
Stopping by to say +100 to using tsibble
.
Thanks for documenting the thinking process. I'll add my thoughts in this thread.
Subclassing would make sense, if panel_data
has additional attributes that tsibble
doesn't have. We just phrase them differently depending on the context, without actual difference. But subclassing is useful for specialist methods such as print()
, summary()
and plotting, because they appear more like panel data.
In both the paper and the package, tsibble
is addressed as temporal data frame rather than time series. (I'm going to de-emphasize the time series aspect of the tsibble
in the paper for revision.) Thus, the identifying variable key
is preferred over series
/panels
. I think panel data isn't a special kind of time series, but both are equally important as temporal data. I refer tsibble
as time series objects most time, because the data that I deal with are long and the subsequent analysis are oriented towards time series forecasting.
tsibble
isn't a grouped tibble?library(dplyr, warn.conflicts = FALSE)
iris <- as_tibble(iris)
grped_iris <- group_by(iris, Species)
bench::mark(
filter = filter(iris, Sepal.Width > 3.5),
grouped_filter = filter(grped_iris, Sepal.Width > 3.5),
check = FALSE
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 filter 65.3µs 70.3µs 13732. 61.9KB 10.5
#> 2 grouped_filter 108.4µs 122.2µs 7753. 14KB 8.80
key
plays a permanent (and essential) role in tsibble
.group_by()
panels for grouped analysis and then ungroup()
, if not going through the panelr docs. But they need to learn a new verb unpanel()
in order to perform overall analysis. For panel data users, individual-level calculations by default for panel data are neat, but when switching to a general context (i.e. working with a data frame), lag()
isn't working as they would expect unless explicitly group_by()
. Maybe grouping is a pain, but we can leave this to the tidyverse team https://github.com/tidyverse/dplyr/issues/4166tsibble
strips off subclasses?Subclasses are assumed to have some attributes. tsibble
has no idea about how to handle those attributes. Similarly, dplyr
and tibble
don't know the strict ordering for a tsibble
, the row-wise verbs for tsibble
have to be defined in order for the change in ordering.
I wanted the use of
lag()
to work right. If you lag in wave 1, then you get NA back. As it is, this only works insidemutate()
andtransmute()
, but that was good enough for me, for now.
doesn't tsibble
give NA
back in wave 1, if using dplyr::lag()
?
Actually, replacing, subclassing, or coercion all sound good. I can do PRs on this, when you come to a decision.
Thanks for engaging on this, Earo.
I'm still thinking that subclassing is the likely route I would want to go. Although I was previously talking about some of the things that made me reluctant to even subclass, I want to be clear that tsibble
is great and I want to encourage users to use those tools (and I don't want to duplicate them). Subclassing gives me some control over what I can consider to be valid input as well as avoiding breaking old code.
One issue with subclassing will be how to deal with multiple key
columns, but that's a minor issue and will be something I can figure out.
As for grouping, I can see both pros and cons. I think it's right for tsibble
not to be grouped, because it's not always appropriate (and for the performance reasons you raised), I just have not been thinking of panel_data
in that way. I also have in mind the material I have out there suggesting panel_data
objects are grouped. But I see your point about ungroup()
vs. unpanel()
and so on. Finding a way to get the lagging right is actually more important to me, which I have only been able to get right by grouping.
Let me give an example of what I consider "right" behavior with regard to lagged variables.
tribble(
~id, ~time, ~x,
"A", 1, 2,
"A", 2, 3,
"A", 3, 4,
"B", 1, 1,
"B", 2, 2,
"B", 3, 3,
"C", 1, 5,
"C", 2, 6,
"C", 3, 7
) %>%
as_tsibble(
key = id, index = time
) %>%
mutate(
lag_x = lag(x)
)
returns
# A tsibble: 9 x 4 [1]
# Key: id [3]
id time x lag_x
<chr> <dbl> <dbl> <dbl>
1 A 1 2 NA
2 A 2 3 2
3 A 3 4 3
4 B 1 1 4
5 B 2 2 1
6 B 3 3 2
7 C 1 5 3
8 C 2 6 5
9 C 3 7 6
But what I consider to be the "correct" behavior (at least in the context of panel data) would return
# A tsibble: 9 x 4 [1]
# Key: id [3]
id time x lag_x
<chr> <dbl> <dbl> <dbl>
1 A 1 2 NA
2 A 2 3 2
3 A 3 4 3
4 B 1 1 NA
5 B 2 2 1
6 B 3 3 2
7 C 1 5 NA
8 C 2 6 5
9 C 3 7 6
Of course, as you noted, grouping doesn't fix the behavior of lag()
outside the tidyverse. plm
has an interesting solution to this, making any column extracted from its pdata.frame
class into a pseries
vector that works sensibly with lag()
outside of tidy functions. That said, I doubt it's workable in the tidyverse and I think it tends to perform slowly, though I'm not sure about that.
@jacob-long since you have said you care a lot about ensuring that lag()
works correctly, you might also be interested in this issue https://github.com/tidyverts/tsibble/issues/55
@earowang I have a slightly optimistic feeling that some kind of vctrs integration into dplyr will help with the subclass dropping problem. Perhaps it will look like calling vec_restore()
at the end of a tsibble function with the signature vec_restore(out, input_tbl)
which dispatches on input_tbl
. If panelr defined a vec_restore.panel_data
method, that would be triggered and out
could be restored to a panel data object after the tsibble/tidyverse operation. I'm not quite sure how that's going to be rolled out through dplyr but I think something along those lines might be the idea.
This discussion is sort of related to this issue, which is about restoring to a grouped data frame vs normal tibble as the output of a tidyr function https://github.com/r-lib/vctrs/issues/211
vctrs
may hold the answer here. As @earowang mentioned in tidyverts/tsibble#55, if each column besides the identifier and time indicator(s) is a vec_ts
object or some such, it can respect the gaps/missingness and presumably the grouped nature of the data.
@earowang mentioned that it would be a good idea for
panelr
to accepttsibble
objects as input to its functions, which should involve the creation of aas_panel_data()
method fortsibble
objects.This should be fairly straightforward as
id
inpanelr
is comparable tokey
intsibble
andwave
inpanelr
is comparable toindex
intsibble
. An important distinction is thattsibble
objects can have multiplekey
columns.In any case, bothEaro notified me that my understanding ofpanel_data
andtsibble
objects are basically modified groupedtibble
objects, so it shouldn't be hard to reconcile the two.tsibble
as a groupedtibble
is not correct.It might be a good idea for me to look at how
tsibble
handles multiplekey
values for some insight into how to deal with #13.