vividfog / nordpool-predict-fi

A Python app and a Random Forest ML model that predicts spot prices for the Nordpool FI market.
MIT License
60 stars 8 forks source link

restrict model to update model when Day Ahead prices are published #17

Open Teme-V opened 4 months ago

Teme-V commented 4 months ago

Predicted day ahead price is updated when prices are available. That give too optimistic predictions when reviewing prediction history. 14.3.2024 prediction changed a lot when actual prices was available.

Before Day Ahead prices are published image

After Nordpool Day Ahead prices are available image

vividfog commented 4 months ago

This week has been full zig zags. I suspect the lack of training on the effect of the Sun may play a role. Maybe the strikes and shutdowns lower the demand so much that the normal pricing patterns don't apply the same way. Maybe more still. This is worth a check when there's more time to do so.

I'm not aware of a data source that has useful Sun history and 5 day prediction at a scale which works as a single column for a country. Or a few columns at most. I use Solcast at home. That's just one address and a few days ahead. Need something more representative than that.

Restricting model updates to only happen at 15:00 might be a partial solution but more data would help even more. I'll watch if it returns to sanity after the demand side normalizes. Sun related prediction would provide more visibility into the production side.

sjksp commented 4 months ago

Actual prices are determined based on weather forecast available the previous day before 12:00 CET. Actual weather during the day of "delivery" doesn't matter.

For example the single hour 2024-03-14 00:00 EET at 1,8 c/kWh (nordpool) stands out as a single abnormally cheap hour, that price was determined 24 hours before the next hour's price (3,3) was set. In those 24 hours the weather forecast changed.

We can estimate the magnitude of this input data error by looking at the intraday and mFFR market data. The past two days volumes have been at around 900MW in intraday and 150 .. 200MW in mFFR. So, in other words, if the model is internally estimating the amount of production based on weather, even if estimates it perfectly correctly, it's still over 1000MW "wrong" for the purposes of forming the price.

Announced shutdowns on the consumption side account for about 550MW. For comparison, in the middle of February it was larger, at over 600MW. On the supply side, announced unavailability is about 2500MW, of which the model should be aware of 1960MW Nuclear (OL3 + OL1), leaving about 500MW hidden from the model... So the "hidden" consumption and supply unavailablity kinda evens out to 0?

(For forming the actual price, outages announced after 12:00 CET will not impact the following day.)

Sun has contributed about 400MW in Finland in recent days, but significantly more in the Baltics, although the effect on FI is somewhat limited by the outage on Estlink2.. Had Estlink2 been operational, prices last week would've been lower during the day and higher during night. If I remember correctly, the model was erring roughly in that direction last week anyway?

Maaxion commented 4 months ago

The main issue with the model accuracy was highlighted in my issue from a few weeks ago, that got dismissed. The way the model is trained is leaking test data into the training set - the model is possibly quite overfit.

To fix it, the train/test split should be adjusted, and more than one year of data should be added to the model. There also seems to be a lot of data removed from the dataset prior to training as "outliers", which IMO might be very damaging to the model acccuracy. Specifically for times like now when the price is very low. These prices do most likely not exist in the training set (I haven't checked).

Before adding a bunch more features to the model, it'd be useful to get the basics right. Would also be interesting to see how a GLM performs, as a lot of the data relations are linear.

pkautio commented 4 months ago

Due to strikes, the consumption is over 500 MW less than it should be affecting the Spot price in significant way. This consumption information (reduction) is also available from Entso-E and could be used to improve the forecast. I will write a module in coming days to estimate the consumption reduction based to Entso-E data.

The other improvement needed is the solar production forecast. It already has impact during the daytime.

Maaxion commented 4 months ago

Actual prices are determined based on weather forecast available the previous day before 12:00 CET. Actual weather during the day of "delivery" doesn't matter.

Apart from the odd way the model fundamentals are coded, this is another significant factor. Now the model retrieves historical weather data (for one location?) when, as you say, it should be using historical weather predictions (Probably the 0600 GMT model runs?).

The actual (or at least short term weather forecast) affects only the intra-day market, not the day ahead market.

sjksp commented 4 months ago

Eyeballing a graph, over the time 2023-01 to present day there's roughly speaking 25 days where consumption unavailability exceeds 500MW. If the model fits the price well at those periods, it would suggest that it's overfit?

I don't know about GLM, but from data/dump.csv "dummy with excel/libreoffice" (me) can achieve a prediction model with 3.6 cents/kWh average error.

Besides nuclear and consumption unavailability, unavailabilities of other forms of production could be beneficial, a column each for "biomass", "peat", "fossil gas", "fossil coal" etc. This would've been especially helpful for predicting the high prices of 2024-01-05 (and also using day-ahead forecasted weather instead of actual weather).

vividfog commented 4 months ago

The main issue with the model accuracy was highlighted in my issue from a few weeks ago, that got dismissed. The way the model is trained is leaking test data into the training set - the model is possibly quite overfit.

To fix it, the train/test split should be adjusted, and more than one year of data should be added to the model. There also seems to be a lot of data removed from the dataset prior to training as "outliers", which IMO might be very damaging to the model acccuracy. Specifically for times like now when the price is very low. These prices do most likely not exist in the training set (I haven't checked).

Before adding a bunch more features to the model, it'd be useful to get the basics right. Would also be interesting to see how a GLM performs, as a lot of the data relations are linear.

I'll come back to the specific suggestions in the topic later.

@Maaxion I didn't dismiss your feedback, but rather went back and forth for a long time, testing many of the things you mentioned, and reported the results too. If you believe, or even better, know there's a fundamentally better way to approach this modeling, please fork the code or data or both and show the results. I've tried several different modeling options and overfit or not, this did work the best so far. If anyone creates a better prediction.py, I will adopt it.

You mention the possible leaking problem, resulting in possible overfit model. I've tried several different splits, the model gets worse with the alternatives, I believe because there's no longer enough data to learn from.

You suggest going back in history and get more data. Curating and cleaning old data sets is a lot of boring work, and I do accept data contributions in the form of predictions.db or dump.csv if someone wants to go to 2022 and before. Personally I I'm not keen on doing so, because the geo/polical situation changed so much in 2022 that it was a year of crazy events. I'm not convinced it helps the model, it could do just the opposite. The only way to know for sure is to try.

You mention the outlier management, but if you check the code, it's specifically taking a lot of the outlier-ness in, and only very conservatively dropping data from the history. The web page says the model is trained multiple times a day, not once at 6:00. It's trained every six hours starting at 3 AM actually. You mention that there's one weather point, but it's 5+5 so that's ten. And they're chosen in a specific way. Please double check the source if you make strong statements.

You earlier believed that there might be plenty of autocorrelation and temporal patterns in the data, but to my best data analysis, those factors play a minor role, also supported by the zig zag nature of intraday patterns and often long term trends too. Among other factors, time related labels are very low in feature importance. I don't think I can convince you otherwise on on this point. The data is open for analysis in this regard. For good measure, all model data is shuffled before training. I tried many different shuffles, with and without month/weekday data. Only the hour makes a significant dent in the results, but then again, correctly predicting intraday prices is a bonus, not the goal.

So if you want to inspire progress in the thread, referring to "go back to basics" is not helpful. Data/code contribution is, as are some of the more specific pointers you've given.

Your observe that prediction should use "old" weather forecasts instead of realized weather, because the price graph does so too. That is a useful, specific critique, with specific advice, and that I will follow. When I have time to work on it. Right now I'm leaning towards tackling the Sun first. And I hope this thread doesn't lead to long back-and forth on fundamentals. The way to tackle that is to re-open the previous thread and contribute data, code or evals there.

pkautio commented 4 months ago

I believe the data prior to 2023 is useless for the training, since so much has happened after that - e.g.:

Maaxion commented 4 months ago

I believe the data prior to 2023 is useless for the training, since so much has happened after that - e.g.:

* OL3

* Wind power capacity increase

* Closure of Ringhals 1 & 2 from Southern Sweden

* Closure of several CHP plants

* Consumption change due to increased usage of Spot prices

Very much disagree, including older data is very important, especially for the main seasonality effect (consumption ~7MW in the summer, ~14MW in the winter), effects of consumption increasing as cold lingers, holidays such as christmas, etc.

By restricting data to 2023-> you leave out a lot of re-occuring events that will still happen this year, and the next.

There are many strategies to deal with black swan events in time series predictions, but throwing out data should be a very last resort.