gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
718 stars 169 forks source link

Time-series forecasting #23

Closed noshad-vida closed 1 year ago

noshad-vida commented 1 year ago

Hi, How can I setup the code for time-series forecasting? What would be the necessary changes? Thanks

gzerveas commented 1 year ago

Hi, it depends a bit on your modeling assumptions. The simplest way would be to predict the final 1 or more (let's say T) values/time steps per variable, conditioned on all previous time steps: p(x{t+T}, ..., x{t+1} | xt, x{t-1}, ..., x_0).

It is very straight-forward to implement this: you would simply need to make a new noise_mask, let's call it forecast_mask, to replace the one in line 234. Instead of randomly generating sequences with a geometrically distributed length along each variable, what this mask needs to do is simply mask the last 1, 2, ..., T (depending on how long you want the prediction horizon to be) values across all sequences - just make sure you don't include padding into this. To wrap this up nicely, I would create a near-clone of the ImputationDataset called ForecastDataset which would invoke this forecast_mask, and a new forecasting task.

The above modeling implies that each of the T predicted values are generated independently of one another. If you would like to condition each predicted value on all previous values, including the ones your model has just generated, you would need to generate a shrinking attention mask to be applied on the input. During training, each sample would be predicted in T steps ("substeps"): assuming an input sequence length of L, in the first substep, the mask would hide all final T values, and the model would be asked to predict p(x_{t+1} | xt, x{t-1}, ..., x0), with t = L-T, i.e. p(x{L-T+1} | x{L-T}, ..., x0). In the next substep, t = L-T+1, the input mask would shrink and hide only the last T-1 values, and the model now would have to predict p(x{L-T+2} | x{L-T+1}, x{L-T}, ..., x_0). As you see, during training, regardless of what the model actually predicted in previous time steps, we are making it use the actually correct values (this is called "teacher forcing"). During inference, where we don't know what the correct values are, the model is using its own previously predicted values to condition the generation of the future values. This second approach is a bit more involved in terms of implementation but may work better in practice.