openclimatefix / uk-pv-national-xg

National PV forecasting using Gradient Boosted Methods.
4 stars 3 forks source link

If production is not very accurate #18

Closed peterdudfield closed 1 year ago

peterdudfield commented 1 year ago
peterdudfield commented 1 year ago

17 see

thomasarmstrong98 commented 1 year ago

Some comments on model improvements and why we will likely see poor out of sample performance:

NWP Masking

Currently the masking is designed for NWP data that we trained on, which is x: 548 x y: 704 but the prod NWP data only has dimensions y: 633, x: 449. I'm not sure where the discrepency comes from yet. At the moment, the model is loading in the nwp coordinates for the training dataset and resizing the prod image by getting the nearest value. There are a few ways to fix it

NWP Variables

This is discussed in the README of the repo but essentially we have to train and infer on the exact same data. At the moment there are missing NWP variables that are likely causing poor model performance. Any changes or missing data variables should be reflected in the training before a final prod model is functioning.

Training Data

I definitely think it would be worthwhile to process and train the model on more data, even if it would just be 2019, although the data goes back further thanks to Jacob.

Night-time

It looks like the model isn't giving a hard 0 for power during night values in production but instead a very low although not zero value.

=========================================================================================== For completeness I'll leave my rambling in this section to look back on. If the NWP data is stale and the model does not know this, then the forecasts will look wrong. For example if we perform inference at 3pm but the NWP data we supply from the database has init_time_utc of 5am, then the +4 hour forecast will be taking NWPs projected for 9am that morning rather than 7pm that night. A to do would be raise an error if the NWP data is too stale.

What is interesting is that this does not seem to happen when we perform out of sample mock estimates with the gcp data, for example on today's date in 2021 (OOS) we get the following. image On day 0 of prod, results look like the following: image I am still not certain on the source of this bug - the remedies above are likely not the root problem.

============================================================================================

peterdudfield commented 1 year ago

sql query

select distinct on (target_time)* from forecast
join forecast_value on forecast.id = forecast_value.forecast_id
join model on forecast.model_id = model.id
where model.name = 'National_xg'
and forecast.created_utc >= '2023-02-21'
order by target_time, forecast_value.created_utc desc
peterdudfield commented 1 year ago

Screenshot 2023-02-23 at 09 36 03 some light comparison

peterdudfield commented 1 year ago

Screenshot 2023-02-24 at 11 17 01

after a few days

peterdudfield commented 1 year ago
Screenshot 2023-03-13 at 11 10 22