openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
399 stars 132 forks source link

Time series forecasting improvements: metrics, dataset preprocessing & AutoGluon wrapper #564

Closed shchur closed 1 year ago

shchur commented 1 year ago

This PR contains several improvements to time series forecasting functionality in AMLB and supersedes #507.

PGijsbers commented 1 year ago

Thanks for updating the PR, sorry for our unresponsiveness on the old one. We currently have an open fedot integration PR and according to their readme they also do "time series prediction problems". Would you be willing to talk with @nicl-nno and see if the two of you can add time series support for the FEDOT integration as well? That would be a good test as to whether the abstraction levels/supported features are sufficient to generalize this beyond just AutoGluonTS.

I am planning to have a closer look at the FEDOT PR tomorrow, but based on a first impression I expect to be able to merge it this week.

shchur commented 1 year ago

Thanks! Sure, I will be happy to help with integrating the time series support for FEDOT.

The current PR does not really introduce new functionality and mostly refactors / fixes the existing code (to make it runnable), so I think it would still be useful, unless we plan to completely refactor all time series code in AMLB.

In my opinion, it would make sense to work on the FEDOT time series PR after #563 and this PR are merged. Current abstractions are readily compatible with other libraries like StatsForecast, GluonTS, AutoPyTorch, so I don't expect there to be a lot of work needed for FEDOT. But if any changes to the core TS logic will be necessary for FEDOT, I'm happy to take care of them. Does that sound reasonable to you?

nicl-nno commented 1 year ago

I agree with @shchur that is is better to separate the TS support for FEDOT to the new PR.

In case it's helpful: we have small self-developed benchmark for time series forecasting named pytsbe. It already contains integration with several AutoML solutions (FEDOT, LAMA, AutoTS, AutoGluon) and baselines. Maybe some code snippets will be useful for AMLB development.

shchur commented 1 year ago

@PGijsbers I have just double-checked, the code works locally, on Docker, and on AWS, so the PR is ready for review.

PGijsbers commented 1 year ago

I agree with @shchur that is is better to separate the TS support for FEDOT to the new PR.

Sure, that sounds fine. I'll have a look at both this week (hopefully starting tomorrow). Thanks!

shchur commented 1 year ago

Thank you for the review and sorry for missing the relevant discussion on the previous PR!

This PR only contains minimal changes necessary to make the code runnable & correct the metric definitions. I have not made any changes to the dataset schema - but I think that the current schema is not ideal and could be substantially improved. I would like to incorporate these changes before updating the documentation. Below is a short overview of the current time series dataset schema and a possible modification.

If these changes look good to you, I will incorporate them to the PR & document them in HOWTO.md.


Current dataset schema (master branch / this PR)

At the bare minimum, each time series task must be defined by

(in this example the dataset contains 2 time series ['A', 'B'] and forecast_horizon_in_steps = 1)

The train.csv and test.csv files must satisfy the following conditions

This dataset schema has a major limitation: the forecast_horizon_in_steps is "baked in" into the train.csv/test.csv files. If we wanted to evaluate on the same dataset with a different value for forecast_horizon_in_steps, we would need to re-generate the train.csv file.

To address this limitation, I propose the following modification to the dataset schema.


Proposed changes (to be implemented in this PR or another PR, if you approve the design)

We only store a single file test.csv that is identical to the test file in the current schema.

The train set (called train.csv before) is generated automatically by AMLB by removing the last forecast_horizon_in_steps values from each time series in the test set.

If the specified forecast_horizon_in_steps is >= the length of the shortest time series in test.csv, an exception is raised.

This also makes it easy to evaluate over multiple folds for time series data: for each fold i in [0, 1, 2, 3, ...], we define

Please let me know if this design looks reasonable to you.

PGijsbers commented 1 year ago

This also makes it easy to evaluate over multiple folds for time series data: for each fold i in [0, 1, 2, 3, ...], we define

  • the test set by removing i * forecast_horizon_in_steps from each original time series in test.csv
  • the train set by removing (i+1) * forecast_horizon_in_steps from each original time series in test.csv

I think that all makes sense, but it seems the description is off to me. Given forecast_horizon_in_steps=K then you want to have:

rows fold 0 fold 1 fold 2 ... fold n
1..K train train train ... train
K..2K test train train ... train
2..3K test train ... train
3..4K test ... train
...
(n-1)K...nK ... test

Right? The way I read your description the test set actually grows by forecast_horizon_in_steps each fold. And I image it is useful to specify a minimum train set greater than forecast_horizon_in_steps (e.g., training data is one year, and the folds consist of subsequent days (e.g., a forecast horizon of 3 days).

shchur commented 1 year ago

It seems that we are describing the same thing, but I was numbering the folds in reverse order šŸ˜…

rows fold 0 fold 1 ... fold (n-1) fold n
1..K train train ... train train
K..2K train train ... train test
2..3K trainĀ  test ... test
... Ā  Ā  Ā  Ā  Ā 
(n-2)K...(n-1)K trainĀ  Ā test
(n-1)K...nK testĀ  Ā 

This means, fold 0 aways corresponds to evaluation on the data with the most recent K timestamps for each time series (which recovers the setting used in most forecasting competitions).

This way if the user decides to evaluate on folds [0, 1, 2, ..., n-1], then the n folds with the largest amount of training data will be used.

PGijsbers commented 1 year ago

the test set by removing i * forecast_horizon_in_steps from each original time series in test.csv

Threw me off since if i=0 then there is nothing removed? But you always want at least forecast_horizon_in_steps removed. But I guess it's just off-by-one error :) Starting with the full(est) data at fold 0 (as per your table) makes sense to me šŸ‘

shchur commented 1 year ago

@PGijsbers I have moved all time series related logic to a new class TimeSeriesDataset that inherits from FileDataset and added instructions on how to add new time series tasks to HOWTO.md. Please let me know if this looks good you.

PGijsbers commented 1 year ago

We are currently experiencing some issues with the OpenML server. However, I can confirm that tests pass locally (except for three which explicitly don't use a cache). I'll wait a little longer to see if OpenML issues resolve, but if they do not I'll likely merge tomorrow (and if they do, then I'll run the tests and assuming all passes, merge it too :)).

shchur commented 1 year ago

Thank you @PGijsbers, I appreciate your help and support with this PR!

PGijsbers commented 1 year ago

Merged! Thank you @shchur (and also the people of the original PR) for being patient with us :)

PGijsbers commented 1 year ago

Just a heads up, I will be out of office in August so I may not respond in that period :)

Innixma commented 1 year ago

Thanks @PGijsbers for all the help provided in the reviews!! Excited to have this merged in.