openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
399 stars 132 forks source link

[PoC] AutoGluon TimeSeries Prototype #494

Closed Innixma closed 1 year ago

Innixma commented 2 years ago

[Don't merge this PR]

This PR is a proof of concept of time series data and framework support in AutoMLBenchmark.

To run, follow the instructions in the newly added frameworks/AutoGluonTS/README.md.

Innixma commented 2 years ago

@sebhrusen @PGijsbers

Some questions I have:

  1. [Solved in Innixma/automlbenchmark/pull/6] Is there a way to specify information such as prediction_length=5 on a per dataset basis? prediction_length is the look-ahead requirement for prediction and dictates the difficulty of the task. I'm wondering if I can specify it as part of the yaml file definition of the dataset in ts.yaml. Ditto for a couple other things like timestamp_column="Date" and item_id="name".

  2. [Solved in Innixma/automlbenchmark/pull/6] How can I update and specify the logic that does the final scoring based on predictions and truth? It needs to be altered to work with TimeSeries. Additionally, it may take a different form, such as if the metric requires quantile predictions to calculate.

PGijsbers commented 2 years ago

@sebhrusen I'd appreciate it if you can have a look, I have very limited availability due to a paper deadline.

sebhrusen commented 2 years ago

@PGijsbers sunny holidays right now: will look at it when I'm back next week.

Innixma commented 2 years ago

Update: Several of the TODO / FIXME comments have been addressed by @limpbot in https://github.com/Innixma/automlbenchmark/pull/6

Innixma commented 2 years ago

Code example:

python3 runbenchmark.py autogluonts ts test

Log output:

``` Running benchmark `autogluonts` on `ts` framework in `local` mode. Loading frameworks definitions from ['/Users/neerick/workspace/code/automlbenchmark/resources/frameworks.yaml']. Loading benchmark constraint definitions from ['/Users/neerick/workspace/code/automlbenchmark/resources/constraints.yaml']. Loading benchmark definitions from /Users/neerick/workspace/code/automlbenchmark/resources/benchmarks/ts.yaml. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 21.7% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 64.3% ----------------------------------------------- Starting job local.ts.test.covid.0.AutoGluonTS. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 55.8% Assigning 4 cores (total=12) for new task covid. Assigning 3803 MB (total=16384 MB) for new covid task. Using training set /Users/neerick/.openml/train.csv with test set /Users/neerick/.openml/test.csv. Running task covid on framework AutoGluonTS with config: TaskConfig({'framework': 'AutoGluonTS', 'framework_params': {}, 'framework_version': '0.5.2', 'type': 'timeseries', 'name': 'covid', 'fold': 0, 'metric': 'mase', 'metrics': ['mase', 'mape', 'smape', 'rmse', 'mse', 'nrmse', 'wape', 'ncrps'], 'seed': 949238273, 'job_timeout_seconds': 1200, 'max_runtime_seconds': 600, 'cores': 4, 'max_mem_size_mb': 3803, 'min_vol_size_mb': -1, 'input_dir': '/Users/neerick/.openml', 'output_dir': '/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514', 'output_predictions_file': '/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/predictions/covid/0/predictions.csv', 'ext': {}, 'type_': 'timeseries', 'output_metadata_file': '/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/predictions/covid/0/metadata.json'}) Running cmd `/Users/neerick/workspace/code/automlbenchmark/frameworks/AutoGluonTS/venv/bin/python -W ignore /Users/neerick/workspace/code/automlbenchmark/frameworks/AutoGluonTS/exec.py` **** AutoGluon TimeSeries [v0.5.2] **** Warning: path already exists! This predictor may overwrite an existing predictor! path="/var/folders/cn/t0r03w5d3nldq9n5h65wd29c0000gs/T/tmpsbe45bwi/" Learner random seed set to 0 ================ TimeSeriesPredictor ================ TimeSeriesPredictor.fit() called Fitting with arguments: {'evaluation_metric': 'MASE', 'hyperparameter_tune_kwargs': None, 'hyperparameters': 'default', 'prediction_length': 30, 'target_column': 'ConfirmedCases', 'time_limit': 600} Provided training data set with 22536 rows, 313 items. Average time series length is 72.0. Training artifacts will be saved to: /private/var/folders/cn/t0r03w5d3nldq9n5h65wd29c0000gs/T/tmpsbe45bwi ===================================================== Validation data is None, will hold the last prediction_length 30 time steps out to use as validation set. AutoGluon will save models to /var/folders/cn/t0r03w5d3nldq9n5h65wd29c0000gs/T/tmpsbe45bwi/ Starting training. Start time is 2022-09-21 09:25:33 Models that will be trained: ['AutoETS', 'ARIMA', 'SimpleFeedForward', 'DeepAR', 'Transformer'] Training timeseries model AutoETS. Training for up to 599.36s of the 599.36s of remaining time. -4261.6502 = Validation score (-MASE) 7.06 s = Training runtime 23.90 s = Validation (prediction) runtime Training timeseries model ARIMA. Training for up to 568.20s of the 568.20s of remaining time. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 22.6% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 64.1% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 55.8% -4291.2952 = Validation score (-MASE) 36.87 s = Training runtime 49.88 s = Validation (prediction) runtime Training timeseries model SimpleFeedForward. Training for up to 480.49s of the 480.49s of remaining time. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 29.3% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 66.6% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 55.8% -4319.9065 = Validation score (-MASE) 100.00 s = Training runtime 2.43 s = Validation (prediction) runtime Training timeseries model DeepAR. Training for up to 378.04s of the 378.04s of remaining time. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 19.6% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 66.3% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 55.8% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 14.5% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 65.1% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 12.8% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 65.2% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% -4332.0235 = Validation score (-MASE) 380.97 s = Training runtime 10.45 s = Validation (prediction) runtime Stopping training due to lack of time remaining. Time left: -13.39 seconds [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 12.6% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 68.7% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% Fitting simple weighted ensemble. -4261.6502 = Validation score (-MASE) 138.62 s = Training runtime 23.90 s = Validation (prediction) runtime Training complete. Models trained: ['AutoETS', 'ARIMA', 'SimpleFeedForward', 'DeepAR', 'WeightedEnsemble'] Total runtime: 816.54 s Best model: AutoETS Best model score: -4261.6502 Model not specified in predict, will default to the model with the best validation score: AutoETS Different set of items than those provided during training were provided for prediction. The model AutoETS will be re-trained on newly provided data [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 14.2% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 60.5% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% mean 0.1 ... 0.8 0.9 item_id timestamp ... Afghanistan_ 2020-03-23 43.673204 40.929207 ... 45.475244 46.417202 2020-03-24 47.477861 43.269943 ... 50.241288 51.685780 2020-03-25 51.282519 45.705039 ... 54.945364 56.859998 2020-03-26 55.087176 48.146160 ... 59.645483 62.028192 2020-03-27 58.891833 50.563691 ... 64.361095 67.219975 ... ... ... ... ... ... Zimbabwe_ 2020-04-17 16.572826 8.359642 ... 21.966592 24.786010 2020-04-18 17.094855 8.458588 ... 22.766468 25.731121 2020-04-19 17.616884 8.550552 ... 23.570930 26.683216 2020-04-20 18.138913 8.635642 ... 24.379906 27.642183 2020-04-21 18.660942 8.713965 ... 25.193326 28.607919 [9390 rows x 10 columns] [43.67320426 47.47786141 51.28251855 ... 17.61688379 18.13891286 18.66094192] [40. 74. 84. ... 25. 25. 28.] Additional data provided, testing on additional data. Resulting leaderboard will be sorted according to test score (`score_test`). Different set of items than those provided during training were provided for prediction. The model AutoETS will be re-trained on newly provided data Different set of items than those provided during training were provided for prediction. The model ARIMA will be re-trained on newly provided data Different set of items than those provided during training were provided for prediction. The model AutoETS will be re-trained on newly provided data [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 12.4% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 62.7% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% model score_test score_val pred_time_test pred_time_val fit_time_marginal fit_order 0 WeightedEnsemble -444.037098 -4261.650234 29.051365 23.899191 138.624613 5 1 AutoETS -444.037098 -4261.650234 25.333331 23.899191 7.057499 1 2 ARIMA -475.878400 -4291.295201 51.673333 49.880237 36.868394 2 3 SimpleFeedForward -526.892250 -4319.906528 1.442273 2.432864 99.998205 3 4 DeepAR -591.905430 -4332.023525 9.755238 10.447713 380.970382 4 Terminating process psutil.Process(pid=25939, name='Python', status='running', started='09:27:32'). Killing process psutil.Process(pid=25939, name='Python', status='running', started='09:27:32'). Early stopping based on learning rate scheduler callback (min_lr was reached). Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main cache[rtype].remove(name) KeyError: '/loky-25767-o7csihc6' Predictions preview: predictions truth 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 y_past_period_error 0 43.673204 40.0 40.929207 41.871165 42.550383 43.130749 43.673204 44.215659 44.796026 45.475244 46.417202 0.666667 1 47.477861 74.0 43.269943 44.714435 45.756015 46.646007 47.477861 48.309715 49.199707 50.241288 51.685780 0.666667 2 51.282519 84.0 45.705039 47.619673 49.000259 50.179919 51.282519 52.385118 53.564778 54.945364 56.859998 0.666667 3 55.087176 94.0 48.146160 50.528868 52.246968 53.715022 55.087176 56.459330 57.927383 59.645483 62.028192 0.666667 4 58.891833 110.0 50.563691 53.422571 55.484025 57.245461 58.891833 60.538205 62.299641 64.361095 67.219975 0.666667 5 62.696490 110.0 52.944972 56.292468 58.706248 60.768734 62.696490 64.624246 66.686732 69.100512 72.448007 0.666667 6 66.501147 120.0 55.284052 59.134650 61.911203 64.283664 66.501147 68.718630 71.091092 73.867644 77.718243 0.666667 7 70.305804 170.0 57.578101 61.947260 65.097731 67.789693 70.305804 72.821916 75.513877 78.664349 83.033508 0.666667 8 74.110461 174.0 59.825896 64.729494 68.265333 71.286577 74.110461 76.934346 79.955590 83.491429 88.395027 0.666667 9 77.915119 237.0 62.027091 67.481124 71.413866 74.774249 77.915119 81.055988 84.416371 88.349113 93.803146 0.666667 10 81.719776 273.0 64.181835 70.202250 74.543392 78.252739 81.719776 85.186813 88.896159 93.237302 99.257717 0.666667 11 85.524433 281.0 66.290563 72.893156 77.654089 81.722132 85.524433 89.326734 93.394776 98.155710 104.758302 0.666667 12 89.329090 299.0 68.353876 75.554236 80.746202 85.182546 89.329090 93.475634 97.911978 103.103944 110.304304 0.666667 13 93.133747 349.0 70.372463 78.185944 83.820014 88.634119 93.133747 97.633375 102.447480 108.081550 115.895032 0.666667 14 96.938404 367.0 72.347060 80.788763 86.875826 92.076996 96.938404 101.799813 107.000983 113.088045 121.529748 0.666667 15 100.743061 423.0 74.278423 83.363190 89.913946 95.511325 100.743061 105.974797 111.572177 118.122933 127.207700 0.666667 16 104.547719 444.0 76.167305 85.909718 92.934683 98.937257 104.547719 110.158180 116.160754 123.185719 132.928132 0.666667 17 108.352376 484.0 78.014453 88.428839 95.938344 102.354939 108.352376 114.349813 120.766408 128.275912 138.690298 0.666667 18 112.157033 521.0 79.820595 90.921030 98.925224 105.764514 112.157033 118.549552 125.388841 133.393036 144.493471 0.666667 19 115.961690 555.0 81.586437 93.386755 101.895615 109.166122 115.961690 122.757258 130.027765 138.536625 150.336943 0.666667 Predictions saved to `/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/predictions/covid/0/predictions.csv`. Loading metadata from `/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/predictions/covid/0/metadata.json`. fatal: not a git repository (or any of the parent directories): .git Loading predictions from `/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/predictions/covid/0/predictions.csv`. Metric scores: { 'app_version': 'dev [NA, NA, NA]', 'constraint': 'test', 'duration': nan, 'fold': 0, 'framework': 'AutoGluonTS', 'id': 'covid', 'info': None, 'mape': 0.47176599878084985, 'mase': 444.03709806992947, 'metric': 'neg_mase', 'mode': 'local', 'models_count': 5, 'mse': 66512955.519554704, 'ncrps': 3.7180833818575727, 'nrmse': 1.8264137433841354, 'params': '', 'predict_duration': 24.022034168243408, 'result': -444.03709806992947, 'rmse': 8155.547530335085, 'seed': 949238273, 'smape': 0.6795078347334532, 'task': 'covid', 'training_duration': 817.2716138362885, 'type': 'timeseries', 'utc': '2022-09-21T16:41:38', 'version': '0.5.2', 'wape': 0.4395118445505013} Job `local.ts.test.covid.0.AutoGluonTS` executed in 984.191 seconds. All jobs executed in 984.232 seconds. [MONITORING] [local.ts.test.covid.0.AutoGluonTS] CPU Utilization: 16.7% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Memory Usage: 59.6% [MONITORING] [local.ts.test.covid.0.AutoGluonTS] Disk Usage: 56.0% Processing results for autogluonts.ts.test.local.20220921T162514 Scores saved to `/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/scores/AutoGluonTS.benchmark_ts.csv`. Scores saved to `/Users/neerick/workspace/code/tmp_amlb/results/autogluonts.ts.test.local.20220921T162514/scores/results.csv`. Scores saved to `/Users/neerick/workspace/code/tmp_amlb/results/results.csv`. Summing up scores for current run: id task fold framework constraint result metric duration seed covid covid 0 AutoGluonTS test -444.037 neg_mase 984.2 949238273 ```
Innixma commented 2 years ago

@sebhrusen Sorry to ping but would you be interested in reviewing this PR? A large chunk of the logic was written by @limpbot who is interning with us currently, and it would be great if he received feedback so as not to block his time-series benchmarking efforts.

sebhrusen commented 2 years ago

@Innixma I'm looking at it now and will make a full review before Monday. Outside implementation details/modularity, I mainly want to be sure that it is not designed to first satisfy AG's timeseries implementation and can be generalized to other implementations (would be nice to have an alternative implementation): for now to satisfy your needs, I'll mainly ensure that the changes are limited to data loading + AG implementation as much as possible.

Innixma commented 2 years ago

Sounds good! I agree that we should make sure the input/ouput/scoring definitions are generic and not AG specific. perhaps the AutoPyTorch-TimeSeries folks (@dengdifan) would be interesting in reviewing / trying to add on their AutoML system as a framework extension to this logic?

Innixma commented 2 years ago

Thanks @sebhrusen for the detailed review!

@limpbot would you like to have a go at addressing some of the comments? Feel free to send a PR to my branch as you did in your prior update.

Innixma commented 1 year ago

I merged @limpbot's changes into this branch via his PR: https://github.com/Innixma/automlbenchmark/pull/7

@sebhrusen The branch should be ready for 2nd round of review.

Innixma commented 1 year ago

Thanks @sebhrusen for the detailed review! @limpbot has addressed some final comments in the latest update, which should also fix the autogluon.tabular error you mentioned.

PGijsbers commented 1 year ago

I think it will be interesting for us (cc: @PGijsbers) to start thinking about supporting new kind of tasks

I missed the "mention" ping (just thought it said a "subscribed"), sorry I didn't check earlier. Definitely, I want to first wait for the JMLR reviews and finish "that part of the project", but creating a more flexible environment for people to add new types of tasks would be a great next thing that invites more people to use (and extend) the benchmark tool.

Thanks Innixma and Limpbot for your contribution 🎉