Closed zhangzw16 closed 4 months ago
The main discrepancy of the chronos model and the results in paper stems from the var_specific_norm
parameter. After refactoring the code, manually setting this parameter to false yields results close to those in the paper. However, there is still an inconsistency in results due to the different test_batch_size
.
In the paper, for the ettm1
dataset with 96->96, the chronos NMAE
metric is 0.393. The results for var_specific_norm=true
is around 0.42. After setting var_specific_norm=false
, the reproduced result with test_batch_size=1
is 0.394, and with test_batch_size=8
is 0.395. The nmae difference is not significant.
test_batch_size after PR #17
exp: moirai etth1 96->24
metrics | tb=1 | tb=4 | tb=8 | tb=16 |
---|---|---|---|---|
test_CRPS | 0.237 | 0.237 | 0.237 | 0.237 |
test_CRPS-Sum | 0.206 | 0.207 | 0.207 | 0.206 |
test_MASE | 1.683 | 1.678 | 1.687 | 1.683 |
test_MSE | 8.441 | 8.534 | 8.417 | 8.477 |
test_MSE-Sum | 116.864 | 119.159 | 117.220 | 116.743 |
test_ND | 0.293 | 0.293 | 0.296 | 0.293 |
test_ND-Sum | 0.256 | 0.258 | 0.260 | 0.257 |
test_NRMSE | 0.559 | 0.563 | 0.561 | 0.563 |
test_NRMSE-Sum | 0.342 | 0.345 | 0.343 | 0.342 |
test_norm_CRPS | 0.296 | 0.297 | 0.297 | 0.297 |
test_norm_CRPS-Sum | 0.603 | 0.605 | 0.606 | 0.602 |
test_norm_MASE | 1.683 | 1.678 | 1.687 | 1.683 |
test_norm_MSE | 0.163 | 0.165 | 0.162 | 0.163 |
test_norm_MSE-Sum | 2.253 | 2.298 | 2.260 | 2.251 |
test_norm_ND | 0.366 | 0.366 | 0.369 | 0.366 |
test_norm_ND-Sum | 0.744 | 0.755 | 0.757 | 0.747 |
test_norm_NRMSE | 0.697 | 0.703 | 0.700 | 0.702 |
test_norm_NRMSE-Sum | 0.981 | 0.991 | 0.985 | 0.979 |
exp: timer etth1 96->24
Test metric | tb=1 | tb=4 | tb=8 | tb=16 |
---|---|---|---|---|
test_CRPS | 0.302 | 0.313 | 0.315 | 0.315 |
test_CRPS-Sum | 0.264 | 0.268 | 0.270 | 0.267 |
test_MASE | 1.663 | 1.702 | 1.700 | 1.716 |
test_MSE | 8.163 | 8.609 | 8.757 | 8.714 |
test_MSE-Sum | 104.652 | 110.948 | 111.694 | 112.216 |
test_ND | 0.302 | 0.313 | 0.315 | 0.315 |
test_ND-Sum | 0.264 | 0.268 | 0.270 | 0.267 |
test_NRMSE | 0.552 | 0.572 | 0.578 | 0.576 |
test_NRMSE-Sum | 0.335 | 0.345 | 0.346 | 0.346 |
test_norm_CRPS | 0.495 | 0.509 | 0.512 | 0.513 |
test_norm_CRPS-Sum | 0.816 | 0.838 | 0.819 | 0.819 |
test_norm_MASE | 1.663 | 1.702 | 1.700 | 1.716 |
test_norm_MSE | 0.382 | 0.401 | 0.399 | 0.404 |
test_norm_MSE-Sum | 5.425 | 5.706 | 5.482 | 5.705 |
test_norm_ND | 0.495 | 0.509 | 0.512 | 0.513 |
test_norm_ND-Sum | 0.816 | 0.838 | 0.819 | 0.819 |
test_norm_NRMSE | 0.743 | 0.768 | 0.771 | 0.772 |
test_norm_NRMSE-Sum | 0.983 | 1.016 | 0.991 | 0.999 |
The problem of test batch size affecting the metric may be because the model has batch-level operators, such as batchnorm. This problem does not occur with the linear model.
The inconsistency in test_batch_size
stems from batch-level operations like batch normalization, rather than a code-level bug. This should not cause any issues because test_batch_size needs to be set to 1 to ensure evaluation fairness and avoid information leakage.
Simply put, if the test batch is composed as follows:
x1, x2, x3 -> x4
x2, x3, x4 -> x5
x3, x4, x5 -> x6
Then, if there are any batch-level interactions, the input information will already include future information (i.e. x4
and x5
). This leads to information leakage in the time series.
test_batch_size after PR #17
exp:
moirai etth1 96->24
reproduce with this script metrics tb=1 tb=4 tb=8 tb=16 test_CRPS 0.237 0.237 0.237 0.237 test_CRPS-Sum 0.206 0.207 0.207 0.206 test_MASE 1.683 1.678 1.687 1.683 test_MSE 8.441 8.534 8.417 8.477 test_MSE-Sum 116.864 119.159 117.220 116.743 test_ND 0.293 0.293 0.296 0.293 test_ND-Sum 0.256 0.258 0.260 0.257 test_NRMSE 0.559 0.563 0.561 0.563 test_NRMSE-Sum 0.342 0.345 0.343 0.342 test_norm_CRPS 0.296 0.297 0.297 0.297 test_norm_CRPS-Sum 0.603 0.605 0.606 0.602 test_norm_MASE 1.683 1.678 1.687 1.683 test_norm_MSE 0.163 0.165 0.162 0.163 test_norm_MSE-Sum 2.253 2.298 2.260 2.251 test_norm_ND 0.366 0.366 0.369 0.366 test_norm_ND-Sum 0.744 0.755 0.757 0.747 test_norm_NRMSE 0.697 0.703 0.700 0.702 test_norm_NRMSE-Sum 0.981 0.991 0.985 0.979 exp:
timer etth1 96->24
reproduce with this script Test metric tb=1 tb=4 tb=8 tb=16 test_CRPS 0.302 0.313 0.315 0.315 test_CRPS-Sum 0.264 0.268 0.270 0.267 test_MASE 1.663 1.702 1.700 1.716 test_MSE 8.163 8.609 8.757 8.714 test_MSE-Sum 104.652 110.948 111.694 112.216 test_ND 0.302 0.313 0.315 0.315 test_ND-Sum 0.264 0.268 0.270 0.267 test_NRMSE 0.552 0.572 0.578 0.576 test_NRMSE-Sum 0.335 0.345 0.346 0.346 test_norm_CRPS 0.495 0.509 0.512 0.513 test_norm_CRPS-Sum 0.816 0.838 0.819 0.819 test_norm_MASE 1.663 1.702 1.700 1.716 test_norm_MSE 0.382 0.401 0.399 0.404 test_norm_MSE-Sum 5.425 5.706 5.482 5.705 test_norm_ND 0.495 0.509 0.512 0.513 test_norm_ND-Sum 0.816 0.838 0.819 0.819 test_norm_NRMSE 0.743 0.768 0.771 0.772 test_norm_NRMSE-Sum 0.983 1.016 0.991 0.999
Description
The reproducibility of the Time-Series Foundation Model (TSFM) does not match the results presented in the paper when following the environment configuration specified in the README.
Specifically, the data.test_batch_size parameter significantly impacts the evaluation metrics, leading to discrepancies between using batch sizes of 1 and 64.
Reproduction Steps
--data.test_batch_size 1
and64
Results on my env
results for
--data.test_batch_size 1
reults for
--data.test_batch_size 64
Observed Behavior: The evaluation metrics differ significantly between --data.test_batch_size 1 and --data.test_batch_size 64.
Expected Behavior: The evaluation metrics should be consistent regardless of the test batch size to ensure reproducibility as stated in the paper.