time-series-foundation-models / lag-llama

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting
Apache License 2.0
1.2k stars 146 forks source link

plotting predictions for 1 step ahead forecasting #55

Closed sebasmos closed 4 months ago

sebasmos commented 4 months ago

I was wondering what would be the best way to evaluate the 1 step ahead prediction with Lag-Llama. Say I have a testing data with 30 samples I want to do zero-shot on, so I want to compute the MAE and CRPS. I defined the estimator with num_samples = 100 and them I am taking the mean for each prediction and enclose this within a for-loop to go through the data:

for idx in range(0, len(testing_data)):
    X_test = pd.DataFrame(testing_data.iloc[idx]).T
    backtest_dataset = PandasDataset(X_test, target="target", freq="T")# PandasDataset<size=1, freq=T, num_feat_dynamic_real=0, num_past_feat_dynamic_real=0, num_feat_static_real=0, num_feat_static_cat=0, static_cardinalities=[]>
    forecasts, tss = get_lag_llama_predictions(
                dataset=backtest_dataset,
                prediction_length=prediction_length,
                device=device,
                context_length=context_length,
                use_rope_scaling=use_rope_scaling,
                num_samples=num_samples
            )

With this I get a CRPS of near 0.9 and a MAE (abs_error) of around 1.29 - which is quite good results per single sample, and this is when plotted:

image

I would have expected that the points would match better per timestamp, Is there something I am missing, or you agree that this would be the correct way to do zero-shot with lag-llama?

Thanks in advance for any advice!

ashok-arjun commented 4 months ago

Hi!

Thanks for the detailed issue.

The way you evaluate it with a for loop may not be correct. You can just set the prediction_length to be 1, and run it through the entire test data. It will be evaluated for one-step prediction in each timestep i.e. Predicts timestep T using history until T-1 Predicts timestep T+1 using history until T (not the previous prediction etc.).

Can you try that?

sebasmos commented 4 months ago

Thanks for your response!

I tried a similar approach where I use the context length as a sliding window at each timestamp, and it gave better much results. On this approach it would not consider the context length as lags for each timestamp but the entire historical data until the current timestamp right?

ashok-arjun commented 4 months ago

Can you elaborate on what you mean by the context length as a sliding window at each timestamp?

As for your question, your maximum lag taken from the history is the value 1093 timesteps behind the timestep to be predicted. Lags can potentially go beyond context length; the context length is taken for the token independently.

sebasmos commented 4 months ago

Thank you @ashok-arjun I solved the issue by using the context length as the number of lags for each prediction. This ensured consistent use of historical data across models. In the past I used the prediction_lengthas you suggested but this only works for a single sample, while I needed to compare for several ground truths so the above was allows me to compare sample by sample over my testing data