tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

can reconcile() be used for distributional forecasts #334

Open alexhallam opened 2 years ago

alexhallam commented 2 years ago

I would like to know if reconcile() can be used with distributional forecasts.

As a reproducible example I took the m5 data and generated naive forecasts for total, ca, wi, and tx. The total is the sum of the three states. I generated 8000 samples over a 10 day forecast horizon. The data set can be grabbed with the following.

wget https://storage.googleapis.com/data_xvzf/m5_hierarchical_forecast.csv

This is the head

      ds         rep y_sim_total y_sim_ca y_sim_tx y_sim_wi 
1     2016-05-23 0   42171.      15851    13205.   13970.   
2     2016-05-23 1   30535.      13646.   13284.   13211.   
3     2016-05-23 2   44897.      16215.   13132.   12586.   
4     2016-05-23 3   39906.      16860.   11361.   13291.   
5     2016-05-23 4   47499.      18238.   13715.   13108.   
6     2016-05-23 5   46119.      18963.   9978.    11286.   
7     2016-05-23 6   46307.      18532.   13197.   12959.   
8     2016-05-23 7   35262.      18930.   11556.   9308.    
9     2016-05-23 8   43877.      18618.   11755.   14966.   

Can I use the reconcile() function to reconcile these forecasts to the total?

robjhyndman commented 2 years ago

Almost. You need to remove the total column and set up the data as a tsibble with state as the key. Then generate the total using aggregate_key(). There is an example at https://otexts.com/fpp3/prison.html.

The distributions are reconciled as described here: https://otexts.com/fpp3/rec-prob.html

alexhallam commented 2 years ago

Thanks for the quick response!

I think I see where I am getting hung up. I am starting with the output from probabilistic forecasts (downstream from where the example starts). It looks like the bottom_up() function expects a mable model. Do you recommend that users make their own mable object in this case? Maybe I can make a simple model based on lm(y_sim ~ 1).

fit <- prison_gts %>%
  filter(year(Quarter) <= 2014) %>%
  model(base = ETS(Count)) %>%
  reconcile(
    bottom_up = bottom_up(base),  # <--- needs a mable object
    MinT = min_trace(base, method = "mint_shrink")
  )

This is as far as I can get

df <- readr::read_csv("https://storage.googleapis.com/data_xvzf/m5_hierarchical_forecast.csv") %>% 
  select(-y_sim_total) %>%
  pivot_longer(cols = y_sim_ca:y_sim_wi, names_to = "state", values_to = "y_sim") %>% 
  as_tsibble(index = ds, key = c(rep, state)) %>% 
  aggregate_key(total = mean(y_sim)) %>% 
  reconcile(
    bottom_up = bottom_up(),  # <-- missing a mable object
    MinT = min_trace(method = "mint_shrink")
  )
mitchelloharawild commented 2 years ago

Currently reconcile() only works in conjunction with the forecast() function, but this is something I'm looking at changing. The complexity here is that you'll need to provide your own weighting matrix, as you no longer have access to the model's training residuals. Some weighting matrices (such as OLS and structural scaling) do not use the model's residuals, and will be easier to use.

At minimum, you'll need to convert your samples into a distribution using dist_sample(). This should be done by summarising over the replicates (grouping by ds and state).

From there you can achieve bottom up reconciliation by summing the distribution column in aggregate_key(). Reconciling optimally currently would require writing your own matrix algebra, but again I'd like to improve this.

alexhallam commented 2 years ago

I see what you are saying. Thanks for the info!

alexhallam commented 2 years ago

Thanks again for the help.

I just wanted to codify what you said above. This is where things are at the moment.

df <- readr::read_csv("https://storage.googleapis.com/data_xvzf/m5_hierarchical_forecast.csv") %>% 
  dplyr::select(-y_sim_total) %>%
  tidyr::pivot_longer(cols = y_sim_ca:y_sim_wi, names_to = "state", values_to = "y_sim") %>% 
  dplyr::group_by(ds, state) %>% 
  dplyr::summarise(dist = distributional::dist_sample(list(y_sim))) %>% 
  as_tsibble(index=ds, key = state) %>% 
  aggregate_key(total = sum(dist)) # <- error (I am not sure what to put here when my column is a dist object)