ngruver / llmtime

https://arxiv.org/abs/2310.07820
MIT License
673 stars 157 forks source link

Prediction length for Monash benchmark #1

Open gorold opened 11 months ago

gorold commented 11 months ago

Hi, may I check how the baseline results for the Monash benchmark (Figure 4, e.g. Wavenet, Transform., DeepAR, etc.) were obtained? From my understanding of the codebase, it is using the huggingface monash_tsf dataset repository to obtain the Monash time series. The prediction length is based on this: https://github.com/ngruver/llmtime/blob/37d0a33ac528e726cf05e6e3996c38a107520fe1/data/monash.py#L43

My concern is that the prediction lengths from the huggingface dataset are different from the default prediction length in the Monash dataset. For example, solar 10 minutes from the hf dataset has a prediction length of 60 while the Monash baseline results have a prediction length of 1008. Please correct me if I am mistaking anything here. Thank you!

ngruver commented 11 months ago

Hi Gerald,

Thanks so much for bringing this to our attention. The monash baseline numbers are from the original paper, and it is possible there is a mismatch in our evaluation. I will be on vacation this upcoming week, but I will take a close look the day I get back.

Nate

gorold commented 10 months ago

Hi @ngruver, any updates on this and plans to release updated results figure/table?

ngruver commented 10 months ago

Hi Gerald, thanks for following up. We've updated the results in the NeurIPS camera ready (https://openreview.net/forum?id=md68e8iZK1). The monash numbers now include 19 datasets:

covid deaths, solar weekly, tourism yearly, tourism quarterly, tourism monthly,
australian electricity demand, pedestrian counts, hospital, fred md, us births, nn5 weekly, nn5 daily, traffic weekly, traffic hourly, saugeenday, cif 2016, bitcoin, weather, sunspot

As you pointed out, solar 10 minutes has a much longer prediction horizon than original represented in the huggingface datasets and therefore we dropped that one from consideration. We corrected the horizons in the other datasets that were inconsistent.

I'm in the process of further expanding to 29 of the datasets by adding the following ones to the analysis:

kdd cup, electricity hourly, electricity weekly, m1 yearly, m1 quarterly, m1 monthly, m3 yearly, m3 quarterly, m3 monthly, m3 other

After I finalize those results, I will update the arxiv.

gorold commented 10 months ago

Thanks for the update!