Examples for forecasting with Transformer model

diegoquintanav commented 3 years ago

Hi! from #125 it's not yet clear to me how a forecasting problem looks like. I've noticed there is a TSForecasting class at the implementation level, which is the same as TSRegression (both are equally set to ToFloat) but it breaks other parts of the API.

See https://github.com/timeseriesAI/tsai/blob/aa8b32a50d52692355214c35cb140f586600db66/tsai/data/core.py#L112

How does inference work in a forecasting example? From the examples I use learner.get_preds(ds_idx=1) but how is this working internally? It uses fastai.learner.Learner.get_preds which I have trouble following :sweat_smile:.

In other words, consider the following gist


from tsai.all import *
print('tsai       :', tsai.__version__)
print('fastai     :', fastai.__version__)
print('fastcore   :', fastcore.__version__)
print('torch      :', torch.__version__)

# tsai       : 0.2.18
# fastai     : 2.4.1
# fastcore   : 1.3.20
# torch      : 1.9.0

# df = some df I loaded
X, y = SlidingWindow(window_length, horizon=horizon, get_x=feature_columns, get_y='my_target_var')(df)
splits = get_splits(y, valid_size=.3, stratify=True, random_state=43, shuffle=False)

# TSRegression = ToFloat
# https://github.com/timeseriesAI/tsai/blob/aa8b32a50d52692355214c35cb140f586600db66/tsai/data/core.py#L111
tfms  = [None, [TSRegression()]]
# tfms  = [None, [TSForecasting()]] # breaks other methods
batch_tfms = TSStandardize(by_sample=True, by_var=True, verbose=True)
dls_inc = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms, bs=128, device="cpu")
learn_inc = ts_learner(dls_inc, InceptionTime, metrics=[mae, mape, mse, rmse], cbs=ShowGraph())
learn_inc.fit_one_cycle(50, 1e-2)

# pred
valid_preds_inc, valid_targets_inc = learn_inc.get_preds(ds_idx=1)
valid_preds_inc.flatten().data, valid_targets_inc.data

plt.plot(valid_preds_inc.flatten().data)
plt.plot(valid_targets_inc.data);
print(valid_targets_inc.shape)

And the output of check_data(X, y, splits)

X      - shape: [1089 samples x 136 features x 7 timesteps]  type: ndarray  dtype:float64  isnan: 0
y      - shape: (1089,)  type: ndarray  dtype:float64  isnan: 0
splits - n_splits: 2 shape: [872, 217]  overlap: [False]

Questions:

what is get_preds(idx=1) doing?
How do I change this problem to a forecasting problem? I'm thinking of a single-step and a multi-step forecasting problem on the test set

Thanks!

oguiza commented 3 years ago

Hi @diegoquintanav,

I've built a quick gist with a dummy dataset to demonstrate how both single-step and multi-step forecasting problems could be implemented with tsai.

As you'll see the key is to build the target with the expected shape. tsai will recognize the shape of the target and will create an output of the same shape.

Questions:

what is get_preds(idx=1) doing? indicates that it's creating predictions for the 1st dataloader, which is the validation one. I usually recommend though the use of this learn.get_X_preds. This will decode the predictions if any batch transform is applied.
There's no need to make any changes. This is already a forecasting problem. The type of problem is really determined by the target you choose. If it's a value for the entire time series, then it's a regression problem. If it's the next steps in the sequence, it then becomes a forecasting problem.

I'm not sure what you mean when you say "but it breaks other parts of the API.". Could you please elaborate on that?

I'm in the process of creating 1 or 2 more detailed examples to demonstrate how to use forecasting using tsai.

diegoquintanav commented 3 years ago

Hi @oguiza, and thanks for the reply!

about the idx=1 argument, It means then that the forecast starts from the index=1 in the splits array, which is the validation set. In other words, it is equivalent to pass X[splits[1]] to get_X_preds, as shown in the notebook. Let me know if my understanding is correct.
The main difference I see about multistep forecasting is the value for horizon > 1. How does one then interpret the plots in the last cell?

for example, in
```
>>> valid_decoded_preds[0]
tensor([0.4506, 0.4616, 0.2979])
```
Are these three values the forecast for t+1, t+2, and t+3, when t=0, and thus valid_decoded_preds[1] contains the same but for t=1?
How would you go about using all inputs from 0 until t, and producing an autoregressive forecast for the entire validation set? In other words, start out with the first window, and reuse the output of the forward pass on that window, for the next window, and so on.
What are the meanings of valid_preds, valid_targets_inc, valid_decoded_preds in the output of get_X_preds? More specifically, what is the difference between decoded and not decoded preds?

About the API breaking, I can't reproduce the issue right now, but some methods implemented in the learner were not working if I used TSForecasting instead of TSRegression. Don't mind that for now. I will open another issue if I can reproduce the problem again.

Thanks again for your time and for the library too!

edit: I understood the meaning of splits[1]

diegoquintanav commented 3 years ago

So I think an autoregressive forecast would look something like this

# name aliasing for local referencing
seq_in_len = window_length
seq_out_len = horizon

# empty placeholder
preds = np.repeat(np.nan, len(splits[1]) + seq_in_len + seq_out_len - 1)

# seed first values
preds[:seq_in_len] = X[splits[1]][0].flatten()

for ix in range(len(splits[1])):
    new_x = preds[: ix + seq_in_len] # alternatively
    _, _, _valid_decoded_preds_inc = learn_inc.get_X_preds(new_x)
    # get size of last output
    _h = _valid_decoded_preds_inc.flatten().shape[0]
    # replace values in placeholder array
    preds[ix + seq_in_len : ix + seq_in_len + _h] = _valid_decoded_preds_inc.flatten()

If I do this with the InceptionTime model, I get something like

fig, ax = plt.subplots(figsize=(15, 7))
ax.plot(preds, label="decoded_preds (AR)")
ax.plot(X[splits[1]][0, 0, :].tolist(), marker="o", label="first window")
ax.plot([np.nan]*window_length + valid_decoded_preds_inc[:, 0].tolist(), label="decoded_preds (Window ahead)")
ax.plot([np.nan]*window_length + y[splits[1]][:, 0].tolist(), label="targets")
fig.legend()

Which looks more like an autoregressive forecast (values are not relevant). Tell me what you think :+1:!

oguiza commented 3 years ago

Hola Diego,

I believe it's correct, but I don't fully understand everything in your code. For example, with this code:

for ix in range(len(splits[1])):
    new_x = preds[: ix + seq_in_len]

new_x is increasingly larger. That doesn't make much sense to me if you have trained the model with equally long inputs.

Having said that, in my experience, results are better with a multiple output forecast (that is, creating the forecast for the entire horizon simultaneously). It'd also be easier to create in tsai. You just need to pass a target with the desired horizon length.

diegoquintanav commented 3 years ago

Right, considering I'm doing a forecast at time t[i] = t_i, I will produce t[i+1:t+horizon]. The model was trained using t[i-window_length:i], so a better way would be something like

new_x = preds[ix: ix + seq_in_len]

that fixes the input dimensionality to window_length. I'm not sure which one is better though. It is true that it is not the way the model was trained, and it produces a totally different output that does not damp over time.

About the multiple output forecast (or multi-step forecast), I believe that the case I'm proposing is the recursive multistep forecast (2)

prediction(t+1) = model(obs(t-1), obs(t-2), ..., obs(t-n))
prediction(t+2) = model(prediction(t+1), obs(t-1), ..., obs(t-n))

and what you suggest is number (4), by setting a large enough number in the horizon argument. In this case, I have many questions about the model itself (I have had trouble understanding the underlying lightning API for training :sweat_smile: )

For the case of the TransformerModel, does this mean that the encoder is fed window_length data and the decoder is fed horizon during training? What happens during inference?
I'd still need to know the meaning of the outputs, as I posted before: the meaning of each element in valid_decoded_preds and what does decoded mean?
Transformer complexity is O(L^2) so I wonder how good is the idea of setting a large horizon.

oguiza commented 3 years ago

I'll try to answer your questions. But before I have a few comments:

You train a model to perform a certain task to have confidence in the predictions it generates. That's why, once the model is trained, you want to use it in the same way. Otherwise, you don't know how well it's performing.
I don't understand why you say you don't know which model is best. What does the mae or whatever metric you care about tell you?
You are indeed applying the 2nd method, which is fine. I'm just telling you that in my experience (with my own datasets) I've generally found method 4 better. But this may or may not be the case with your dataset. It's always good to understand the differences and try both approaches. I always learn a lot when I test alternative approaches.
For any problem, the forecast horizon is given to you. You know how much history you want to use, and how many steps in advance you need to predict. I don't know which these are in your case.

As to your questions:

TransformerModel doesn't have a decoder. It only has an encoder. You only need to ensure X (input) and y (output) have the desired shape. X: [n_samples x n_variables x history] and y: [n_samples x horizon] for univariate and [n_samples x n_variables x horizon] for multivariate. tsai will automatically create a head that will generate the expected output shape. This applies to all models, not just to TransformerModel. You can also, for example, try this approach with InceptionTimeor TST.
In the case of a regression task, the first value is the prediction. The second is the target (if you pass a y to get_X_preds), otherwise None. The 3rd is the decoded prediction. That is if you pass a reversible Transform to the target, it will be reversed. If you don't apply any, the 1st and 3rd terms will be the same (they usually are).
The horizon defines the length of the target. And the target is not passed through the model. So no need to worry about that. You only need to worry about the amount of history used. If that is too long, you may have a memory issue as you rightly say. In that case, I'd recommend you to use a CNN like InceptionTime.

diegoquintanav commented 3 years ago

Hey, thanks for answering! Everything is more clear now. I will close the issue.

timeseriesAI / tsai

Examples for forecasting with Transformer model #159