Closed pkautio closed 4 months ago
Couple of ideas to improve the forecast:
- Available transit capacity between FI-SE1, FI-SE3 and FI-EE has major impact to the prices at certain conditions
This data should be available from Entso-E as market messages. With Entso-E-py package:
from entsoe import EntsoePandasClient import pandas as pd
client = EntsoePandasClient(api_key="")
start = pd.Timestamp('202300101', tz='Europe/Helsinki') end = pd.Timestamp('20241231', tz='Europe/Helsinki') country_code = 'FI' # Finland country_code_from = 'FI' # Finland country_code_to = 'SE_1' # Finland-Northern Sweden
transit_unavailability = client.query_unavailability_transmission(country_code_from, country_code_to, start=start, end=end, docstatus=None, periodstartupdate=None, periodendupdate=None)
Similary query for all connections and directions.
- Nuclear power capacity forecast
This could be done based on UMM Remit messages. These should be available from Entso-E.
Ensto-E-py package provides ready-made interface to Entso-E.
I updated README.md to tell more about how to add a new input data source to the prediction pipeline.
For these two:
Transit lines would be new columns. Update the DB schema, backfill the data, create a utility function to infer the near future data, when the prediction pipeline runs. Using the "last known" value likely reaps most of the effecfs, as these inputs don't change as quickly and as often as, say, the weather.
Nuclear availability prediction based on UMM messages would be an updated or alternative fingrid.py routine. Refactor the util/fingrid.py module in such a way that instead of assuming last-known-good value is the future value, factor in the effect of UMM Remit messages. A potential pitfall here is to deduce, which of the recent-past real production numbers already contain the effect of the new message(s). So that we don't add the effect twice! Example: If a message says there's going to be a 1 GW reduction starting "yesterday" ... how do we know it actually did start yesterday, and the realised numbers therefore already contain the effect of this message? What if the planned reduction was a bit late, it hasn't actually started yet, and the UMM messages are now a stack of messages, akin to a changelog or a commit sequence?
Sanity pre-checks might include:
How big is the effect of the transit lines? No way to know other than by trying. The challenge is that there's not much past data for the model to learn what the price effect may be. Transit lines haven't been down that much.
Assume we already had accurate non-fragile way to include UMM messages. What's the effect? The current logic waits for the change to actually happen, and then assumes that things will be like this until the production numbers go up again. In practice this leads to a 6-24 hour period where the predictions are off, as the change is happening either down or up. But then the predictions self-correct, as the new ground truth becomes part of the input. Is the added complexity worth the potential improvement during these change periods?
This makes me personally a bit conservative about adding these as input factors, but I very much welcome the efforts to hack with these ideas and see how they behave with real data.
Here's how:
https://github.com/vividfog/nordpool-predict-fi?tab=readme-ov-file#adding-a-new-data-source
Data from Fingrid doesn't show nuclear availability, it shows realized nuclear production.
Realized nuclear production consists of two factors: a) Is the plant technically able to produce? b) Is anyone interested in purchasing the produced power at the seller's desired price? UMM answers a, realized production answers b.
I imagine the model would do better fed with a, and b is quite possibly entirely redundant.
UMMs can overlap, but in general the "worst" UMM overrides any other message regarding the same asset.
a) Planned maintance data is available from Entso-E. I prepared python code to gather this data and convert that to per-hour forecast time series for next 5 days. The code is almost ready and can be added to this project soon.
b) Nuclear plants are generally always producing electricity with the exception of corner cases when the spot price is negative. For the forecast it should not matter.
a) Planned maintance data is available from Entso-E. I prepared python code to gather this data and convert that to per-hour forecast time series for next 5 days. The code is almost ready and can be added to this project soon.
b) Nuclear plants are generally always producing electricity with the exception of corner cases when the spot price is negative. For the forecast it should not matter.
This sounds great.
How are you handling the edge cases where there's planned downtime and it has already started but not in full? Gradient curve vs realized data. Same when going up again?
Or if it didn't start in schedule? Or it started early? Or the UMM came late, sudden failure. Updated UMMs as things clear up. Again, merging the info with the realized data as the gradient is happening or is early or late?
What kind of real world impact it has if the gradient and edge cases are ignored? Would it still result in a better prediction than a naive extrapolation of "last known value" .. on average. Or a worse prediction? Under what scenarios?
Overall, interesting to see how you resolve the touchpoint between realized vs predicted nuclear MW. Thanks for taking the challenge.
Forecast is forecast. It's based on market messages for planned maintenance. Forecast includes hours 5 days forward (time series) from current time forward.
Realised production is realised production. For price forecast it does not matter if the planned maintenance started few hours late, since the price has already been fixed. Of course this produces incorrect training data for the future forecasts for individual hours.
Added nuclear forecast script and opened pull request. You will need Entso-E API key to use the script.
Script fetches Planned Unavailability data of Finnish nuclear plants from Entso-E API and modifies the data to capacity forecast time series.
Thanks a lot. It's work week again, so review might be pushed towards the end of the week. I will come back with questions if needed.
A quick comment. I saw this generates the forecast and it's straightforward. Excellent. If you feel like it, you can include the code that integrates this to the end to end forecast. Or I can when I get to it.
It looks like my existing nuclear function could by default call this new function, instead of what it does today. But retain support for the old way for a while. That would enable end to end testing to see an A/B comparison with and without unavailability data. For some ML stats.
Or did you already have a view on how you'd like to integrate this into the pipeline? Reading this on mobile currently, sorry if I missed any existing notes on that. @pkautio
ENTSO-E code is now integrated, README updated, and the next prediction will use market messages in a few hours. Hats off to @pkautio for figuring out the ENTSO-E part 👍 ... and if the forecasts are off the wall tomorrow, that my fault. At the time of writing this, it all worked.
Sample run:
python nordpool_predict_fi.py --train --predict
[2024-03-07 23:12:34] Nordpool Predict FI
Training a new model candidate using the data in the database...
* FMI Weather Stations for Wind: ['ws_101673', 'ws_101256', 'ws_101846', 'ws_101267']
* FMI Weather Stations for Temperature: ['t_101786', 't_101118', 't_100968', 't_101339']
→ Feature Importance:
Feature Importance
t_101339 0.211443
ws_101256 0.181828
t_100968 0.162449
NuclearPowerMW 0.106445
t_101786 0.066850
hour 0.062656
ws_101673 0.047487
day_of_week 0.042330
ws_101846 0.040911
t_101118 0.034386
month 0.027271
ws_101267 0.015943
→ Durbin-Watson autocorrelation test: 2.00
→ ACF values for the first 5 lags:
Lag 1: 1.0000
Lag 2: -0.0014
Lag 3: -0.0237
Lag 4: -0.0202
Lag 5: -0.0028
Lag 6: -0.0080
→ Model trained:
MAE (vs test set): 1.7806310108575483
MSE (vs test set): 17.125934255294478
R² (vs test set): 0.8378969382433199
MAE (vs 10x500 randoms): 1.2451800318949335
MSE (vs 10x500 randoms): 15.15771097880135
R² (vs 10x500 randoms): 0.8548413431241844
→ Model NOT saved to the database but remains available in memory for --prediction.
→ Training done.
Running predictions...
* Fetching wind speed forecast and historical data between 2024-02-29 and 2024-03-12
* Fetching temperature forecast and historical data between 2024-02-29 and 2024-03-12
* Fetching nuclear power production data between 2024-02-29 and 2024-03-12 and inferring missing values
* Fingrid: Fetched 2648 hours, aggregated to 133 hourly averages spanning from 2024-02-29 to 2024-03-05
→ Fingrid: Using last known nuclear power production value: 2764 MW
* ENTSO-E: Fetching nuclear downtime messages...
→ ENTSO-E: Avg: 2772, max: 2772, min: 2772 MW
* Fetching electricity price data between 2024-02-29 and 2024-03-12
→ Days of data coverage (should be 7 back, 5 forward for now): 12
→ Found a newly created in-memory model for predictions
Timestamp PricePredict_cpkWh ws_101256 ws_101267 ws_101673 ws_101846 t_101118 t_101339 t_101786 t_100968 NuclearPowerMW Price_cpkWh
0 2024-02-29 23:00:00+00:00 0.186325 14.2 13.1 11.6 11.2 0.49 0.38 1.57 0.80 4249.545 0.0000
1 2024-03-01 00:00:00+00:00 0.268877 13.8 12.4 11.6 11.3 0.31 0.43 1.62 0.67 4228.760 0.0000
2 2024-03-01 01:00:00+00:00 0.239165 13.5 11.4 11.3 11.1 0.55 0.35 1.70 0.58 4228.825 0.0000
3 2024-03-01 02:00:00+00:00 0.338605 13.4 10.2 11.1 10.9 0.60 0.31 1.60 0.34 4229.235 0.0000
4 2024-03-01 03:00:00+00:00 0.729866 13.0 10.0 10.9 10.4 0.62 0.28 1.55 0.01 4228.350 0.0012
.. ... ... ... ... ... ... ... ... ... ... ... ...
283 2024-03-12 18:00:00+00:00 14.795136 2.3 2.1 4.8 2.0 -4.70 -7.11 -4.98 -5.57 2772.000 NaN
284 2024-03-12 19:00:00+00:00 12.084022 2.2 2.1 4.7 2.1 -5.29 -7.53 -5.28 -6.27 2772.000 NaN
285 2024-03-12 20:00:00+00:00 12.363734 2.0 2.1 4.7 2.3 -5.87 -7.94 -5.59 -6.97 2772.000 NaN
286 2024-03-12 21:00:00+00:00 10.250964 1.9 2.4 4.5 2.5 -4.83 -6.69 -8.43 -4.51 2772.000 NaN
287 2024-03-12 22:00:00+00:00 8.946699 1.8 2.5 4.4 2.6 -5.52 -7.10 -8.61 -5.14 2772.000 NaN
[288 rows x 12 columns]
* Predictions NOT committed to the database (no --commit).
Couple of ideas to improve the forecast:
1) Available transit capacity between FI-SE1, FI-SE3 and FI-EE has major impact to the prices at certain conditions
This data should be available from Entso-E as market messages. With Entso-E-py package:
from entsoe import EntsoePandasClient import pandas as pd
client = EntsoePandasClient(api_key="")
start = pd.Timestamp('202300101', tz='Europe/Helsinki') end = pd.Timestamp('20241231', tz='Europe/Helsinki') country_code = 'FI' # Finland country_code_from = 'FI' # Finland country_code_to = 'SE_1' # Finland-Northern Sweden
transit_unavailability = client.query_unavailability_transmission(country_code_from, country_code_to, start=start, end=end, docstatus=None, periodstartupdate=None, periodendupdate=None)
Similary query for all connections and directions.
2) Nuclear power capacity forecast
This could be done based on UMM Remit messages. These should be available from Entso-E.
Ensto-E-py package provides ready-made interface to Entso-E.