microsoft / ProbTS

ProbTS is a benchmarking toolkit for time series forecasting.
MIT License
107 stars 12 forks source link

Test Batch Size Affects Evaluation Metrics in Time-Series Foundation Model #13

Closed zhangzw16 closed 4 months ago

zhangzw16 commented 4 months ago

Description

The reproducibility of the Time-Series Foundation Model (TSFM) does not match the results presented in the paper when following the environment configuration specified in the README.

Specifically, the data.test_batch_size parameter significantly impacts the evaluation metrics, leading to discrepancies between using batch sizes of 1 and 64.

Reproduction Steps

DATA_DIR=./datasets
LOG_DIR=./exps

DATASET='ettm1'
CTX_LEN=96
PRED_LEN=96

MODEL='timer'
python run.py --config config/tsfm/${MODEL}.yaml --seed_everything 0  \
        --data.data_manager.init_args.path ${DATA_DIR} \
        --trainer.default_root_dir ${LOG_DIR} \
        --data.data_manager.init_args.split_val true \
        --data.data_manager.init_args.dataset ${DATASET} \
        --data.data_manager.init_args.context_length ${CTX_LEN} \
        --data.data_manager.init_args.prediction_length ${PRED_LEN} \
        --data.test_batch_size 1

Results on my env

results for --data.test_batch_size 1

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         test_CRPS         │    0.38880524039268494    │
│       test_CRPS-Sum       │    0.3483585715293884     │
│         test_MASE         │            inf            │
│         test_MSE          │    14.216867446899414     │
│       test_MSE-Sum        │    193.04457092285156     │
│          test_ND          │    0.38880524039268494    │
│        test_ND-Sum        │    0.3483585715293884     │
│        test_NRMSE         │    0.7435467839241028     │
│      test_NRMSE-Sum       │    0.4670811593532562     │
│      test_norm_CRPS       │    0.5931107401847839     │
│    test_norm_CRPS-Sum     │    0.9076345562934875     │
│      test_norm_MASE       │            inf            │
│       test_norm_MSE       │    0.5778313875198364     │
│     test_norm_MSE-Sum     │     8.684906005859375     │
│       test_norm_ND        │    0.5931107401847839     │
│     test_norm_ND-Sum      │    0.9076345562934875     │
│      test_norm_NRMSE      │    0.9068989157676697     │
│    test_norm_NRMSE-Sum    │    1.1654224395751953     │
└───────────────────────────┴───────────────────────────┘

reults for --data.test_batch_size 64

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                                          [75/92]
┃        Test metric        ┃       DataLoader 0        ┃                                                                                                 
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                                                 
│         test_CRPS         │    0.3870095908641815     │                                                                                                 
│       test_CRPS-Sum       │    0.3348708748817444     │                                                                                                 
│         test_MASE         │     4.121578693389893     │                                                                                                 
│         test_MSE          │    13.889582633972168     │                                                                                                 
│       test_MSE-Sum        │    188.11325073242188     │                                                                                                 
│          test_ND          │    0.3870095908641815     │                                                                                                 
│        test_ND-Sum        │    0.3348708748817444     │                                                                                                 
│        test_NRMSE         │     0.804040253162384     │                                                                                                 
│      test_NRMSE-Sum       │    0.4874984920024872     │                                                                                                 
│      test_norm_CRPS       │     0.588572084903717     │                                                                                                 
│    test_norm_CRPS-Sum     │    0.8631325364112854     │                    
│      test_norm_MASE       │     3.316856861114502     │                    
│       test_norm_MSE       │    0.5665557384490967     │                    
│     test_norm_MSE-Sum     │     8.484588623046875     │                    
│       test_norm_ND        │     0.588572084903717     │                    
│     test_norm_ND-Sum      │    0.8631325364112854     │                    
│      test_norm_NRMSE      │    0.9544867873191833     │                    
│    test_norm_NRMSE-Sum    │    1.1842191219329834     │                    
└───────────────────────────┴───────────────────────────┘    

Observed Behavior: The evaluation metrics differ significantly between --data.test_batch_size 1 and --data.test_batch_size 64.

Expected Behavior: The evaluation metrics should be consistent regardless of the test batch size to ensure reproducibility as stated in the paper.

zhangzw16 commented 4 months ago

The main discrepancy of the chronos model and the results in paper stems from the var_specific_norm parameter. After refactoring the code, manually setting this parameter to false yields results close to those in the paper. However, there is still an inconsistency in results due to the different test_batch_size.

In the paper, for the ettm1 dataset with 96->96, the chronos NMAE metric is 0.393. The results for var_specific_norm=true is around 0.42. After setting var_specific_norm=false, the reproduced result with test_batch_size=1 is 0.394, and with test_batch_size=8 is 0.395. The nmae difference is not significant.

detailed results for chronos ettm1 96->96 (batch 1 vs 8) results for `--data.test_batch_size=1` ``` ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test_CRPS │ 0.33056917786598206 │ │ test_CRPS-Sum │ 0.3184971213340759 │ │ test_MASE │ inf │ │ test_MSE │ 16.776161193847656 │ │ test_MSE-Sum │ 247.76821899414062 │ │ test_ND │ 0.3941699266433716 │ │ test_ND-Sum │ 0.3692256510257721 │ │ test_NRMSE │ 0.7903082966804504 │ │ test_NRMSE-Sum │ 0.5236336588859558 │ │ test_norm_CRPS │ 0.39110830426216125 │ │ test_norm_CRPS-Sum │ 0.756203293800354 │ │ test_norm_MASE │ inf │ │ test_norm_MSE │ 0.3234284222126007 │ │ test_norm_MSE-Sum │ 4.776735305786133 │ │ test_norm_ND │ 0.46682143211364746 │ │ test_norm_ND-Sum │ 0.8863762617111206 │ │ test_norm_NRMSE │ 0.9276809096336365 │ │ test_norm_NRMSE-Sum │ 1.2148205041885376 │ └───────────────────────────┴───────────────────────────┘ ``` results for `--data.test_batch_size=8` ``` ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test_CRPS │ 0.33104127645492554 │ │ test_CRPS-Sum │ 0.3071160614490509 │ │ test_MASE │ 4.243149280548096 │ │ test_MSE │ 16.599308013916016 │ │ test_MSE-Sum │ 245.1790008544922 │ │ test_ND │ 0.39517465233802795 │ │ test_ND-Sum │ 0.3562717139720917 │ │ test_NRMSE │ 0.8614497780799866 │ │ test_NRMSE-Sum │ 0.5519154667854309 │ │ test_norm_CRPS │ 0.3956098258495331 │ │ test_norm_CRPS-Sum │ 0.7043299674987793 │ │ test_norm_MASE │ 4.243149280548096 │ │ test_norm_MSE │ 0.32001885771751404 │ │ test_norm_MSE-Sum │ 4.726817607879639 │ │ test_norm_ND │ 0.4719715714454651 │ │ test_norm_ND-Sum │ 0.8175352811813354 │ │ test_norm_NRMSE │ 1.0279762744903564 │ │ test_norm_NRMSE-Sum │ 1.2731246948242188 │ └───────────────────────────┴───────────────────────────┘ ```
zhangzw16 commented 4 months ago

test_batch_size after PR #17

exp: moirai etth1 96->24

reproduce with this script ``` bash export CUDA_VISIBLE_DEVICES=0 DATA_DIR=./datasets LOG_DIR=./exps # MOIRAI MODEL='moirai' for DATASET in 'etth1'; do for CTX_LEN in 96; do for PRED_LEN in 24; do for test_batch_size in 1 4 8 16; do python run.py --config config/tsfm/${MODEL}/context_${CTX_LEN}/${DATASET}.yaml --seed_everything 0 \ --data.data_manager.init_args.path ${DATA_DIR} \ --trainer.default_root_dir ${LOG_DIR} \ --data.data_manager.init_args.dataset ${DATASET} \ --data.data_manager.init_args.prediction_length ${PRED_LEN} \ --data.test_batch_size ${test_batch_size} done done done done ```
metrics tb=1 tb=4 tb=8 tb=16
test_CRPS 0.237 0.237 0.237 0.237
test_CRPS-Sum 0.206 0.207 0.207 0.206
test_MASE 1.683 1.678 1.687 1.683
test_MSE 8.441 8.534 8.417 8.477
test_MSE-Sum 116.864 119.159 117.220 116.743
test_ND 0.293 0.293 0.296 0.293
test_ND-Sum 0.256 0.258 0.260 0.257
test_NRMSE 0.559 0.563 0.561 0.563
test_NRMSE-Sum 0.342 0.345 0.343 0.342
test_norm_CRPS 0.296 0.297 0.297 0.297
test_norm_CRPS-Sum 0.603 0.605 0.606 0.602
test_norm_MASE 1.683 1.678 1.687 1.683
test_norm_MSE 0.163 0.165 0.162 0.163
test_norm_MSE-Sum 2.253 2.298 2.260 2.251
test_norm_ND 0.366 0.366 0.369 0.366
test_norm_ND-Sum 0.744 0.755 0.757 0.747
test_norm_NRMSE 0.697 0.703 0.700 0.702
test_norm_NRMSE-Sum 0.981 0.991 0.985 0.979

exp: timer etth1 96->24

reproduce with this script ``` bash export CUDA_VISIBLE_DEVICES=0 DATA_DIR=./datasets LOG_DIR=./exps MODEL='timer' for DATASET in 'etth1'; do for CTX_LEN in 96; do for PRED_LEN in 24; do for test_batch_size in 1 4 8 16; do python run.py --config config/tsfm/${MODEL}.yaml --seed_everything 0 \ --data.data_manager.init_args.path ${DATA_DIR} \ --trainer.default_root_dir ${LOG_DIR} \ --data.data_manager.init_args.dataset ${DATASET} \ --data.data_manager.init_args.context_length ${CTX_LEN} \ --data.data_manager.init_args.prediction_length ${PRED_LEN} \ --data.test_batch_size ${test_batch_size} done done done done ```
Test metric tb=1 tb=4 tb=8 tb=16
test_CRPS 0.302 0.313 0.315 0.315
test_CRPS-Sum 0.264 0.268 0.270 0.267
test_MASE 1.663 1.702 1.700 1.716
test_MSE 8.163 8.609 8.757 8.714
test_MSE-Sum 104.652 110.948 111.694 112.216
test_ND 0.302 0.313 0.315 0.315
test_ND-Sum 0.264 0.268 0.270 0.267
test_NRMSE 0.552 0.572 0.578 0.576
test_NRMSE-Sum 0.335 0.345 0.346 0.346
test_norm_CRPS 0.495 0.509 0.512 0.513
test_norm_CRPS-Sum 0.816 0.838 0.819 0.819
test_norm_MASE 1.663 1.702 1.700 1.716
test_norm_MSE 0.382 0.401 0.399 0.404
test_norm_MSE-Sum 5.425 5.706 5.482 5.705
test_norm_ND 0.495 0.509 0.512 0.513
test_norm_ND-Sum 0.816 0.838 0.819 0.819
test_norm_NRMSE 0.743 0.768 0.771 0.772
test_norm_NRMSE-Sum 0.983 1.016 0.991 0.999
xumwen commented 4 months ago

The problem of test batch size affecting the metric may be because the model has batch-level operators, such as batchnorm. This problem does not occur with the linear model.

zhangzw16 commented 4 months ago

The inconsistency in test_batch_size stems from batch-level operations like batch normalization, rather than a code-level bug. This should not cause any issues because test_batch_size needs to be set to 1 to ensure evaluation fairness and avoid information leakage.

Simply put, if the test batch is composed as follows:

x1, x2, x3 -> x4
x2, x3, x4 -> x5
x3, x4, x5 -> x6

Then, if there are any batch-level interactions, the input information will already include future information (i.e. x4 and x5). This leads to information leakage in the time series.

test_batch_size after PR #17

exp: moirai etth1 96->24

reproduce with this script metrics tb=1 tb=4 tb=8 tb=16 test_CRPS 0.237 0.237 0.237 0.237 test_CRPS-Sum 0.206 0.207 0.207 0.206 test_MASE 1.683 1.678 1.687 1.683 test_MSE 8.441 8.534 8.417 8.477 test_MSE-Sum 116.864 119.159 117.220 116.743 test_ND 0.293 0.293 0.296 0.293 test_ND-Sum 0.256 0.258 0.260 0.257 test_NRMSE 0.559 0.563 0.561 0.563 test_NRMSE-Sum 0.342 0.345 0.343 0.342 test_norm_CRPS 0.296 0.297 0.297 0.297 test_norm_CRPS-Sum 0.603 0.605 0.606 0.602 test_norm_MASE 1.683 1.678 1.687 1.683 test_norm_MSE 0.163 0.165 0.162 0.163 test_norm_MSE-Sum 2.253 2.298 2.260 2.251 test_norm_ND 0.366 0.366 0.369 0.366 test_norm_ND-Sum 0.744 0.755 0.757 0.747 test_norm_NRMSE 0.697 0.703 0.700 0.702 test_norm_NRMSE-Sum 0.981 0.991 0.985 0.979 exp: timer etth1 96->24

reproduce with this script Test metric tb=1 tb=4 tb=8 tb=16 test_CRPS 0.302 0.313 0.315 0.315 test_CRPS-Sum 0.264 0.268 0.270 0.267 test_MASE 1.663 1.702 1.700 1.716 test_MSE 8.163 8.609 8.757 8.714 test_MSE-Sum 104.652 110.948 111.694 112.216 test_ND 0.302 0.313 0.315 0.315 test_ND-Sum 0.264 0.268 0.270 0.267 test_NRMSE 0.552 0.572 0.578 0.576 test_NRMSE-Sum 0.335 0.345 0.346 0.346 test_norm_CRPS 0.495 0.509 0.512 0.513 test_norm_CRPS-Sum 0.816 0.838 0.819 0.819 test_norm_MASE 1.663 1.702 1.700 1.716 test_norm_MSE 0.382 0.401 0.399 0.404 test_norm_MSE-Sum 5.425 5.706 5.482 5.705 test_norm_ND 0.495 0.509 0.512 0.513 test_norm_ND-Sum 0.816 0.838 0.819 0.819 test_norm_NRMSE 0.743 0.768 0.771 0.772 test_norm_NRMSE-Sum 0.983 1.016 0.991 0.999