tidyverts / feasts

Feature Extraction And Statistics for Time Series
https://feasts.tidyverts.org/
291 stars 23 forks source link

Add functions that returns dataframes for specific plots and remove `gg_lag()`, `gg_subseries()`, etc. #65

Closed zenggyu closed 5 years ago

zenggyu commented 5 years ago

feasts provides gg_lag(), gg_subseries(), etc. that returns various plots directly. This is quite convenient, but it also limits the power of users to customize the visualization. I think it would be nice if the package can provide some intermediate functions that returns a dataframe which contains the processed data needed to make the plot; and then users can visualize the plot using self-defined ggplot() statements or using the generic function autoplot(); then, gg_lag(), gg_subseries() can be removed.

Similar ideas have been implemented in yardstick::roc_curve(), yardstick::pr_curve(), as well as feasts::ACF(). Besides main data which can be stored in a dataframe, additional information (e.g., data required to plot the dashed lines in an ACF plot; additional class information which indicates how autoplot() should plot the data) can be stored as attributes of the dataframe.

I believe the above proposal can make the package more powerful and consistent with the tidyverse, what do you think?

mitchelloharawild commented 5 years ago

While it is useful to be able to easily extract the data used in gg_*(), I think that it should return a tsibble/tibble, rather than something with a new class that can be passed into autoplot(). Although I wasn't aware of this usage in yardstick, so I'll definitely look into this more. One thing I definitely want to avoid is functions that produce both data and a plot (such as stats::acf()).

As you mention for the ACF, it would be nice to also contain the dashed lines in the data. I completely agree with you here and think that the ACF object should contain a distribution column. I've mentioned this briefly in issue #1. Note that because of this added complexity/structure in the ACF object, I'm using a new class here to support the autoplot() method. I'm still questioning if this is the correct approach.

For the case of gg_subseries() and gg_season() this manipulation should be fairly simple (as shown below). gg_lag() is a bit more difficult, but it would be analogous to stats::embed() for a tsibble. This would be something that I think is better suited for the tsibble package by @earowang.


gg_subseries()

library(feasts)
library(dplyr)
tsibbledata::aus_production %>% 
  gg_subseries(Beer)

tsibbledata::aus_production %>% 
  transmute(Beer, facet = quarters(Quarter))
#> # A tsibble: 218 x 3 [1Q]
#>    Quarter  Beer facet
#>      <qtr> <dbl> <chr>
#>  1 1956 Q1   284 Q1   
#>  2 1956 Q2   213 Q2   
#>  3 1956 Q3   227 Q3   
#>  4 1956 Q4   308 Q4   
#>  5 1957 Q1   262 Q1   
#>  6 1957 Q2   228 Q2   
#>  7 1957 Q3   236 Q3   
#>  8 1957 Q4   320 Q4   
#>  9 1958 Q1   272 Q1   
#> 10 1958 Q2   233 Q2   
#> # … with 208 more rows

Created on 2019-07-22 by the reprex package (v0.3.0)

gg_season()

library(feasts)
library(dplyr)
tsibbledata::aus_production %>% 
  gg_season(Beer)

tsibbledata::aus_production %>% 
  transmute(Beer, colour = lubridate::year(Quarter))
#> # A tsibble: 218 x 3 [1Q]
#>    Quarter  Beer colour
#>      <qtr> <dbl>  <dbl>
#>  1 1956 Q1   284   1956
#>  2 1956 Q2   213   1956
#>  3 1956 Q3   227   1956
#>  4 1956 Q4   308   1956
#>  5 1957 Q1   262   1957
#>  6 1957 Q2   228   1957
#>  7 1957 Q3   236   1957
#>  8 1957 Q4   320   1957
#>  9 1958 Q1   272   1958
#> 10 1958 Q2   233   1958
#> # … with 208 more rows

Created on 2019-07-22 by the reprex package (v0.3.0)

mitchelloharawild commented 5 years ago

You can of course also access the plot data from a ggplot object using ggplot2::layer_data() :smile: