sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.99k stars 631 forks source link

Time varying known reals #700

Open TDominiak opened 3 years ago

TDominiak commented 3 years ago

I wonder if it is possible to use prediction for one of my features as time_variing_known_reals

Let us assume that, based on historical data, I would like to forecast energy production for the next 8 hours. one of my features is the irradiation forecast. For each time step, I know the forecasted values for the next 8 hours.

For example, I can easily encode the hour as a time_varying_known_categoricals, but what should i do with a feature that changes with each step like my irradiation predictions? Can I include this prediction in my training data?

Thanks

georgeblck commented 3 years ago

If you encode the hour as a time_varying_known_categoricals you do actually know its value in the future. If you have an irradiation forecast and encode it as time_varying_known_reals you do actually know the forecast in the future. However, you do not know the actual irradation value. And that is the one that most strongly correlates with energy production (I am assuming we are talking about solar energy). The forecast contains noise that on average will weaken the correlation with energy production.

So you have three options:

  1. Use the actual irradiation values as time_varying_unknown_reals and ignore the irradiaton completely in the prediction. Which could possibly sometimes make sense, but generally is a bad idea.
  2. Use the actual irradiation values as time_varying_known_reals. This is cumbersome when you are training because the validation metrics during training are underestimating the error. Because irl when you predict the energy you have the forecasted irradiation (incl. noise) and not the actual values as during training. So this problem needs to be solved.
  3. Use the forecasted irradiation values as time_varying_known_reals; i.e. don't use the actual values at all. This only works if you have all the past irradiation forecasts available. Usually past forecasts are not as easily available as actual values; but you said you have them. This way you don't have to change any part of the training process.

You have described the last option and I also think it makes the most sense. Because when predicting you only have the forecasted values available, so it is best to train this relationship between forecasts (incl. noise) and energy production. The second option is more valid in a statistical sense as you train the relationship between irradiation and energy output; but your predictive performance will always be decreased by the quality of irradiation forecasts. If, however, you have extremely good irradiation forecasts, it might be worth looking into the second option.


Also don't forget, that when using the third option you need to deliberately choose which irradiation forecast is the one that goes into your model. E.g. for the irradiation at 11am you have one actual value but 8 forecasted values: the forecast from 8 hours ago, 7 hours ago etc. You want to choose the forecast distance which resembles your actual prediction task.

ruuttt commented 2 years ago
I am working on a similar issue. I have past temperature forecasts available + measurements of the actual temperature. I would like to go for the third option as mentioned by georgeblck. In my dataframe, I have a reference datetime column (containing the moment the forecast was made by the exogeneous meteo model) and a valid datetime column, which contains the time in the future for which the forecast was made. A simplified version of my data frame is: ref_time_idx valid_time_idx temp_measured temp_forecasted
0 0 0 10.1 10.2
1 0 1 np.nan 10.3
2 0 2 np.nan 10.4
3 1 1 10.0 10.4
4 1 2 np.nan 10.5
5 1 3 np.nan 10.3

I think it makes sense to add another column named ahead with the following definition: df["ahead"] = df["valid_time_idx"] - df["ref_time_idx"]

I would really appreciate if you could help me with answering the following two questions:

  1. Is it ok to use np.nan like this?
  2. Is it ok to use the following parameters when initializing the TimeSeriesDataSet object?
    target="temp_measured",
    time_idx="ref_time_idx",
    time_varying_known_categoricals=["ref_time_idx ","ahead"],
    time_varying_known_reals="temp_forecasted",
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals="temp_measured"
georgeblck commented 2 years ago

First of all: as long as you are just fiddling with variables there is no okay or not okay. Feature engeneering in time series forecasting is even more nebulous than it is in computer vision etc. - because prediction tasks are so different and there is no fixed wisdom about how much domain knowledge is needed to optimize your model. Try it out and see how your loss changes. The most important thing is that you have a robust testing environment so that you can always trust and compare the test error of your different runs. Especially when your test is in essence a backcast - i.e. you are simulating past predictions and have to replicate the exact past circumstances.


It is a good question that you are asking: how can one utilise the multiple step-ahead predictions in weather forecasting. What you are trying to do right now is train a model to forecast temperature for multiple time-steps. I am not even sure if that as a whole is a good idea because the distance between forecast and measurement influences the prediction itself so much (in meteo systems). E.g. forecasting weather/irradiation/etc. less than 3 hours ahead - nowcasting - is an entirely different beast than forecasting two days ahead.

The biggest problem is that you cannot use ref_time_idx as time_idx because it is not unique. There are at least two ways to solve this

  1. Use the variable ahead as group_ids.
  2. Generate a new incremental time_idx for that same parameter. Use ahead as time_varying_known_reals. (As long as a variable measures an intensity I usually put it in reals.)

In both cases I would also


So to answer your two questions

  1. It does not make sense to use np.nan as you are doing it.
  2. As fas as I know it should not work because time_idx is not unique.

Is your final task really predicting the true value of temp? That would be different than what OP was talking about - forecast energy production with an irradiation forecast.

ruuttt commented 2 years ago

@georgeblck, thanks for you concise answer. Very helpful. My final task is to predict icing on windturbine blades using weather forecasts + measurements from humidity, temperature and precipitation sensors which have been installed on a hundred turbine nacelles. We have used camera observations to determine ice growth on the blades, so these can be used to train the model.

I am very much aware of the large difference between nowcasting and day-ahead forecasting. I plan to make 3 separate Pytorch forecasting models. (1) 3 hours ahead, (2) 24h hours ahead, (3) 5 days ahead. Each model should approximately forecast 40 (so multiple) timestamps (for the last two model, I will downsample the data). Ideally I would like to have multiple columns as target, but I understand that this is (currently) not possible. My first model, will have temperature as target and for follow up models I plan to have ice growth observations as a target.

As baseline I plan to use:

  1. the weather forecast + empirical relation between temp/humidity/precipitation and icing to make a simple icing forecast without any machine learning algorithms.
  2. lineair interpolation of last two sensor measurements for the first 3 hours.

I could not find a PyTorch Forecasting sample project in which time varying known features that depend on prediction time are used. That's strange to me because in many model applications the knowns will depend on prediction time. @jdb78 , is this possible at all with PyTorch Forecasting? Would you advise to go for the option to use the ahead column as group_ids?

@georgeblck, your second solution to use ahead as time_varying_known_reals does not work since I have multiple forecast per timestamp. If I understand you correctly, implementing your first solution (ahead as group_ids) would look like this: ref_time_idx valid_time_idx ahead location_id temp_measured temp_forecasted
0 0 0 0 0 10.1 10.2
1 0 1 1 0 10.0 10.3
2 0 2 2 0 9.9 10.4
3 1 1 0 0 10.0 10.4
4 1 2 1 0 9.9 10.5
5 1 3 2 0 9.8 10.3
target="temp_measured",
time_idx="valid_time_idx",
group_ids=["ahead","location_id"]
time_varying_known_categoricals=["valid_time_idx"],
time_varying_known_reals="temp_forecasted",
time_varying_unknown_categoricals=[],
time_varying_unknown_reals="temp_measured"

ref_time_idx is obsolete and will be deleted (but was kept to understand what was done compared to the previous table) Would it be a problem that the group_ids=["ahead","location_id"] would have a size of approximately 40x100=4000 elements?

georgeblck commented 2 years ago

What do you mean when you say you would like to have multiple columns as a target? You mean a multiple and multivariate prediction? And is that in regards to your three proposed models?

Could you also clarify what you mean with: time varying known features that depend on prediction time? As far as practicing time-series forecasting goes: you always predict at a certain time and will use the most recent weather prediction that is available. Incorporating previous predictions of your parameters will amplify the noise that these weather forecasts contain and only minimally improve your model. If you have enough data you can check what kind of variance in measurements is explained by old forecasts when you already have the most recent ones. I would not expect there to be an interesting result but the reality of weather forecasts is that they depend on the big weather models and those are calculated at certain times. So there is a weird time factor here but one that will be complicated to exploit.

Having explained this, let me rephrase my suggestions. Using ahead as group_ids will make your code run but I don't think it is sensible. At the moment you are making your prediction the ahead-variable will indeed vary for your multiple timesteps ahead but it is still the best you can do at that moment.

Also from my empirical knowledge of weather and energy forecasts, the distance between forecast and measurement (your ahead-variable) has a marginal effect on the forecasting quality when you aggregate by hours and stick to a distance of 2-20 hours ahead. Again, caused by the calculation schedule of the big weather models and the fickleness of weather itself.

I'm just saying all this because I know no one that approaches the problem in the way you have proposed. Having said that: it is an interesting question that warrants some thinking. The quickest path to an answer lies in trying out two versions and seeing how they perform.