Understanding Evaluation Metrics for Probabilistic Time Series Forecasting with Transformers: Can GluonTS' make_evaluation_predictions be Compatible?

hanlaoshi commented 7 months ago

"I read the blog post (https://huggingface.co/blog/zh/time-series-transformers) and I'm really confused about the evaluation metrics. I would be extremely grateful if I could get some help. Specifically, I'm unsure about how to use GluonTS' make_evaluation_predictions to evaluate the results obtained from a Transformer model for probabilistic time series forecasting. Can these two be used together, or are they incompatible?"

I hope this helps! If you have any further questions, feel free to ask.

kashif commented 7 months ago

@hanlaoshi so as far as I know the make_evaluation_predictions required a gluonts prediction object which obviously is not possible here... so the next best would be to obtain the target and samples from the transformers side and call the appropriate metric functions in gluonts's evaluators...

also note the blog post has been updated to incorporate the "back-testing" dataloader via the "validation" splitter when doing back-testing

hanlaoshi commented 7 months ago

@hanlaoshi so as far as I know the make_evaluation_predictions required a gluonts prediction object which obviously is not possible here... so the next best would be to obtain the target and samples from the transformers side and call the appropriate metric functions in gluonts's evaluators...

also note the blog post has been updated to incorporate the "back-testing" dataloader via the "validation" splitter when doing back-testing

First off, thanks a bunch for your generous insights! I experimented with the approach you shared from your last response ([https://github.com/huggingface/evaluate/pull/509]) and found that measuring prediction results from extracting them and the last rolling window in the test set (Ground Truth) aligns with the results from GluonTS's make_evaluation_predictions. On another note, I'm facing a bit of a headache with multi-variable probabilistic time series forecasting using either vanilla transformer or informer on the traffic_nips dataset.

For instance, in the case of the vanilla transformer on the traffic_nips dataset, with parameters like d_model=32, epochs=48, encoder(decoder)_ffn_dim=256, attention_heads=4, num_encoder_layers=3, context_length=72, num_batches_per_epoch=100, lr=1e-3, the final CRPS_sum is around 0.170. Considering the literature, it seems like CRPS_sum for transformer-based predictions should generally be around 0.05.

Given your experience, any advice on what might be causing this and how to improve the results would be highly appreciated!

kashif commented 7 months ago

BTW can you redo with the bug fix in the blog #1558 i.e. use the validation splitter for the backtesting scenario

hanlaoshi commented 7 months ago

BTW can you redo with the bug fix in the blog #1558 i.e. use the validation splitter for the backtesting scenario

Hey there! Thanks a bunch for your suggestions. I tried out the validation splitter for backtesting on the traffic_nips dataset, and it indeed showed significant improvements in some metrics like CRPS and MSE, as you pointed out.

I'm a bit puzzled, though. When I swapped out "instance_sampler = create_instance_splitter(config, 'test')" with "instance_sampler = create_instance_splitter(config, 'validation')" on the traffic dataset, the boost in metrics like CRPS and MSE was remarkably noticeable. Could you shed some light on why this specific change made such a difference?

def ValidationSplitSampler(
    axis: int = -1, min_past: int = 0, min_future: int = prediction_length
) -> PredictionSplitSampler:
    return PredictionSplitSampler(
        allow_empty_interval=True,
        axis=axis,
        min_past=min_past,
        min_future=min_future,
    )

def TestSplitSampler(
    axis: int = -1, min_past: int = 0
) -> PredictionSplitSampler:
    return PredictionSplitSampler(
        allow_empty_interval=False,
        axis=axis,
        min_past=min_past,
        min_future=0,
    )

Also, it seems like using the validation splitter for backtesting only works well on certain datasets like traffic, but not so much on others (e.g., the solar dataset). On some metrics (CRPS, MSE), there's a substantial increase (performance drop), but I'm not entirely sure because I used the hyperparameters from the test splitter. I'm still in the midst of experiments, and I'll loop back once I confirm if the validation splitter is universally effective. Thanks a ton for your advice, looking forward to hearing from you!

kashif commented 7 months ago

so the test-splitter is more for the production use-case in the sense it just takes the very last context window and starts predicting into the unknown future... while the validation splitter is for the "back-testing" usecase where we take the very last context + the prediction window (the prediction window is not given to the model) and the model predicts this window.

If the context window size matches the seasonality of the time series then the predictions using the test-splitter might be fine (still wrong but fine) however if the there is some mis-match then the predictions are out of sync with the test-splitter. That is why you see the differences. in any case use the validation splitter for the back-testing.

hope that helps!

hanlaoshi commented 7 months ago

so the test-splitter is more for the production use-case in the sense it just takes the very last context window and starts predicting into the unknown future... while the validation splitter is for the "back-testing" usecase where we take the very last context + the prediction window (the prediction window is not given to the model) and the model predicts this window.

If the context window size matches the seasonality of the time series then the predictions using the test-splitter might be fine (still wrong but fine) however if the there is some mis-match then the predictions are out of sync with the test-splitter. That is why you see the differences. in any case use the validation splitter for the back-testing.

hope that helps!

So, if we're conducting experiments using the validation splitter, the key is to align the context window size with the seasonality of the time series. Could the seasonality of the time series be reasonably chosen through "get_lags_for_frequency(freq)" to avoid blindly tuning context_length? Is my understanding correct? Thanks a lot for your help!

kashif commented 7 months ago

no the context window is up to you to select... all i meant to say was that for back-testing use the validation splitter and by coincidence since the context window lined up with the seasonalities of the time series, the test splitter's output looked reasonable (but still technically wrong)

huggingface / blog

Understanding Evaluation Metrics for Probabilistic Time Series Forecasting with Transformers: Can GluonTS' make_evaluation_predictions be Compatible? #1651