17 see

thomasarmstrong98 commented 1 year ago

Some comments on model improvements and why we will likely see poor out of sample performance:

NWP Masking

Currently the masking is designed for NWP data that we trained on, which is x: 548 x y: 704 but the prod NWP data only has dimensions y: 633, x: 449. I'm not sure where the discrepency comes from yet. At the moment, the model is loading in the nwp coordinates for the training dataset and resizing the prod image by getting the nearest value. There are a few ways to fix it

Interpolate + extrapolate the prod image instead, although the default xarray method for this does not seem to work (see https://github.com/pydata/xarray/discussions/6189)
Create a separate mask for the prod data and apply this during inference (just needs the (x, y) lat lon coordinates for the data). This can be done using gradboost_pv.preprocessing.region_filtered.generate_polygon_mask.
Retrain the model by altering the training data to select only the coordinates available in the prod NWP dataset.

NWP Variables

This is discussed in the README of the repo but essentially we have to train and infer on the exact same data. At the moment there are missing NWP variables that are likely causing poor model performance. Any changes or missing data variables should be reflected in the training before a final prod model is functioning.

Training Data

I definitely think it would be worthwhile to process and train the model on more data, even if it would just be 2019, although the data goes back further thanks to Jacob.

Night-time

It looks like the model isn't giving a hard 0 for power during night values in production but instead a very low although not zero value.

There is functionality to clip forecasts below a threshold in the model configuration, currently set to 0.5% but maybe that should be increased to 2-3%.
Or there can be some logic using pvlib data to set pv output to 0 outside certain hours.
This issue is likely caused by the MSE loss function, perhaps an improvement would be to split training into two parts with two models, day and night - so that the magnitude of day errors don't distract learning the night time 0's.

=========================================================================================== For completeness I'll leave my rambling in this section to look back on. If the NWP data is stale and the model does not know this, then the forecasts will look wrong. For example if we perform inference at 3pm but the NWP data we supply from the database has init_time_utc of 5am, then the +4 hour forecast will be taking NWPs projected for 9am that morning rather than 7pm that night. A to do would be raise an error if the NWP data is too stale.

What is interesting is that this does not seem to happen when we perform out of sample mock estimates with the gcp data, for example on today's date in 2021 (OOS) we get the following. On day 0 of prod, results look like the following: I am still not certain on the source of this bug - the remedies above are likely not the root problem.

============================================================================================

peterdudfield commented 1 year ago

sql query

select distinct on (target_time)* from forecast
join forecast_value on forecast.id = forecast_value.forecast_id
join model on forecast.model_id = model.id
where model.name = 'National_xg'
and forecast.created_utc >= '2023-02-21'
order by target_time, forecast_value.created_utc desc

peterdudfield commented 1 year ago

Screenshot 2023-02-23 at 09 36 03 some light comparison

peterdudfield commented 1 year ago

Screenshot 2023-02-24 at 11 17 01

after a few days

peterdudfield commented 1 year ago

openclimatefix / uk-pv-national-xg

If production is not very accurate #18

17 see

NWP Masking

NWP Variables

Training Data

Night-time