Time series forecasting improvements: metrics, dataset preprocessing & AutoGluon wrapper

shchur commented 1 year ago

This PR contains several improvements to time series forecasting functionality in AMLB and supersedes #507.

Refactor the wrapper for AutoGluon-TimeSeries
- The code now works with AutoGluon v0.8.* - checked local, Docker and AWS.
- Both AutoGluon-Tabular and AutoGluon-TimeSeries now correspond to a single framework autogluon with the same installation procedure. The correct predictor object is used depending on the DatasetType.
Task definition
- Now quantile_levels can be defined as a part of task definition, instead of always setting this argument to [0.1, 0.2, ..., 0.8, 0.9].
- More checks added to make sure that time series datasets are constructed correctly.
Refactor metric definitions to make them consistent with the textbook definitions.
- Always use the predictions column when computing point forecast metrics
- Add new metric: scaled pinball loss
Store auxiliary information needed to compute metrics (in-sample seasonal error & item IDs) as dataset attributes instead of delegating their computation to individual frameworks (as was done before).
- Some scaled metrics (e.g., mean absolute scaled error, scaled pinball loss) require information based on the historic time series values (before the forecast horizon begins). The Result object in AMLB does not have access to historic information, so we store this extra information as optional_columns in the predictions dataframe.
Explain how to define a time series forecasting task in HOWTO.md.

Work left for future PRs
Add more AutoML frameworks for forecasting
Add more time series datasets

PGijsbers commented 1 year ago

Thanks for updating the PR, sorry for our unresponsiveness on the old one. We currently have an open fedot integration PR and according to their readme they also do "time series prediction problems". Would you be willing to talk with @nicl-nno and see if the two of you can add time series support for the FEDOT integration as well? That would be a good test as to whether the abstraction levels/supported features are sufficient to generalize this beyond just AutoGluonTS.

I am planning to have a closer look at the FEDOT PR tomorrow, but based on a first impression I expect to be able to merge it this week.

shchur commented 1 year ago

Thanks! Sure, I will be happy to help with integrating the time series support for FEDOT.

The current PR does not really introduce new functionality and mostly refactors / fixes the existing code (to make it runnable), so I think it would still be useful, unless we plan to completely refactor all time series code in AMLB.

In my opinion, it would make sense to work on the FEDOT time series PR after #563 and this PR are merged. Current abstractions are readily compatible with other libraries like StatsForecast, GluonTS, AutoPyTorch, so I don't expect there to be a lot of work needed for FEDOT. But if any changes to the core TS logic will be necessary for FEDOT, I'm happy to take care of them. Does that sound reasonable to you?

nicl-nno commented 1 year ago

I agree with @shchur that is is better to separate the TS support for FEDOT to the new PR.

In case it's helpful: we have small self-developed benchmark for time series forecasting named pytsbe. It already contains integration with several AutoML solutions (FEDOT, LAMA, AutoTS, AutoGluon) and baselines. Maybe some code snippets will be useful for AMLB development.

shchur commented 1 year ago

@PGijsbers I have just double-checked, the code works locally, on Docker, and on AWS, so the PR is ready for review.

PGijsbers commented 1 year ago

I agree with @shchur that is is better to separate the TS support for FEDOT to the new PR.

Sure, that sounds fine. I'll have a look at both this week (hopefully starting tomorrow). Thanks!

shchur commented 1 year ago

Thank you for the review and sorry for missing the relevant discussion on the previous PR!

This PR only contains minimal changes necessary to make the code runnable & correct the metric definitions. I have not made any changes to the dataset schema - but I think that the current schema is not ideal and could be substantially improved. I would like to incorporate these changes before updating the documentation. Below is a short overview of the current time series dataset schema and a possible modification.

If these changes look good to you, I will incorporate them to the PR & document them in HOWTO.md.

Current dataset schema (master branch / this PR)

At the bare minimum, each time series task must be defined by

forecast_horizon_in_steps - a positive integer that specifies how many steps into the future need to be forecast for each time series (= number of future time series values that need to be predicted)
two CSV files
- train.csv with 3 columns (item_id - unique ID of each time series, timestamp of the observation, target time series value), with contents such as
item_id timestamp target

A 2020-01-01 2.0

A 2020-01-02 4.0

B 2022-05-01 1.0

B 2020-05-02 3.0

B 2020-05-03 1.0
- test.csv with contents such as
item_id timestamp target

A 2020-01-01 2.0

A 2020-01-02 4.0

A 2020-01-03 5.0

B 2022-05-01 1.0

B 2020-05-02 3.0

B 2020-05-03 1.0

B 2020-05-04 0.0

item_id	timestamp	target
A	2020-01-01	2.0
A	2020-01-02	4.0
B	2022-05-01	1.0
B	2020-05-02	3.0
B	2020-05-03	1.0

item_id	timestamp	target
A	2020-01-01	2.0
A	2020-01-02	4.0
A	2020-01-03	5.0
B	2022-05-01	1.0
B	2020-05-02	3.0
B	2020-05-03	1.0
B	2020-05-04	0.0

(in this example the dataset contains 2 time series ['A', 'B'] and forecast_horizon_in_steps = 1)

The train.csv and test.csv files must satisfy the following conditions

The timestamps for each individual time series are regularly sampled (e.g., daily, monthly, 30 min, frequency).
For each time series in train.csv, test.csv contains all past values + exactly forecast_horizon_in_steps future values corresponding to the forecast horizon.

This dataset schema has a major limitation: the forecast_horizon_in_steps is "baked in" into the train.csv/test.csv files. If we wanted to evaluate on the same dataset with a different value for forecast_horizon_in_steps, we would need to re-generate the train.csv file.

To address this limitation, I propose the following modification to the dataset schema.

Proposed changes (to be implemented in this PR or another PR, if you approve the design)

We only store a single file test.csv that is identical to the test file in the current schema.

The train set (called train.csv before) is generated automatically by AMLB by removing the last forecast_horizon_in_steps values from each time series in the test set.

If the specified forecast_horizon_in_steps is >= the length of the shortest time series in test.csv, an exception is raised.

This also makes it easy to evaluate over multiple folds for time series data: for each fold i in [0, 1, 2, 3, ...], we define

the test set by removing i * forecast_horizon_in_steps from each original time series in test.csv
the train set by removing (i+1) * forecast_horizon_in_steps from each original time series in test.csv

Please let me know if this design looks reasonable to you.

PGijsbers commented 1 year ago

This also makes it easy to evaluate over multiple folds for time series data: for each fold i in [0, 1, 2, 3, ...], we define

the test set by removing i * forecast_horizon_in_steps from each original time series in test.csv

the train set by removing (i+1) * forecast_horizon_in_steps from each original time series in test.csv

I think that all makes sense, but it seems the description is off to me. Given forecast_horizon_in_steps=K then you want to have:

rows	fold 0	fold 1	fold 2	...	fold n
1..K	train	train	train	...	train
K..2K	test	train	train	...	train
2..3K		test	train	...	train
3..4K			test	...	train
...
(n-1)K...nK				...	test

Right? The way I read your description the test set actually grows by forecast_horizon_in_steps each fold. And I image it is useful to specify a minimum train set greater than forecast_horizon_in_steps (e.g., training data is one year, and the folds consist of subsequent days (e.g., a forecast horizon of 3 days).

shchur commented 1 year ago

It seems that we are describing the same thing, but I was numbering the folds in reverse order 😅

rows	fold 0	fold 1	...	fold (n-1)	fold n
1..K	train	train	...	train	train
K..2K	train	train	...	train	test
2..3K	train	test	...	test
...
(n-2)K...(n-1)K	train	test
(n-1)K...nK	test

This means, fold 0 aways corresponds to evaluation on the data with the most recent K timestamps for each time series (which recovers the setting used in most forecasting competitions).

This way if the user decides to evaluate on folds [0, 1, 2, ..., n-1], then the n folds with the largest amount of training data will be used.

PGijsbers commented 1 year ago

the test set by removing i * forecast_horizon_in_steps from each original time series in test.csv

Threw me off since if i=0 then there is nothing removed? But you always want at least forecast_horizon_in_steps removed. But I guess it's just off-by-one error :) Starting with the full(est) data at fold 0 (as per your table) makes sense to me 👍

shchur commented 1 year ago

@PGijsbers I have moved all time series related logic to a new class TimeSeriesDataset that inherits from FileDataset and added instructions on how to add new time series tasks to HOWTO.md. Please let me know if this looks good you.

PGijsbers commented 1 year ago

We are currently experiencing some issues with the OpenML server. However, I can confirm that tests pass locally (except for three which explicitly don't use a cache). I'll wait a little longer to see if OpenML issues resolve, but if they do not I'll likely merge tomorrow (and if they do, then I'll run the tests and assuming all passes, merge it too :)).

shchur commented 1 year ago

Thank you @PGijsbers, I appreciate your help and support with this PR!

PGijsbers commented 1 year ago

Merged! Thank you @shchur (and also the people of the original PR) for being patient with us :)

PGijsbers commented 1 year ago

Just a heads up, I will be out of office in August so I may not respond in that period :)

Innixma commented 1 year ago

Thanks @PGijsbers for all the help provided in the reviews!! Excited to have this merged in.

openml / automlbenchmark