Closed peterdudfield closed 1 year ago
Some comments on model improvements and why we will likely see poor out of sample performance:
Currently the masking is designed for NWP data that we trained on, which is x: 548 x y: 704 but the prod NWP data only has dimensions y: 633, x: 449. I'm not sure where the discrepency comes from yet. At the moment, the model is loading in the nwp coordinates for the training dataset and resizing the prod image by getting the nearest value. There are a few ways to fix it
gradboost_pv.preprocessing.region_filtered.generate_polygon_mask
.This is discussed in the README of the repo but essentially we have to train and infer on the exact same data. At the moment there are missing NWP variables that are likely causing poor model performance. Any changes or missing data variables should be reflected in the training before a final prod model is functioning.
I definitely think it would be worthwhile to process and train the model on more data, even if it would just be 2019, although the data goes back further thanks to Jacob.
It looks like the model isn't giving a hard 0 for power during night values in production but instead a very low although not zero value.
=========================================================================================== For completeness I'll leave my rambling in this section to look back on. If the NWP data is stale and the model does not know this, then the forecasts will look wrong. For example if we perform inference at 3pm but the NWP data we supply from the database has init_time_utc of 5am, then the +4 hour forecast will be taking NWPs projected for 9am that morning rather than 7pm that night. A to do would be raise an error if the NWP data is too stale.
What is interesting is that this does not seem to happen when we perform out of sample mock estimates with the gcp data, for example on today's date in 2021 (OOS) we get the following. On day 0 of prod, results look like the following: I am still not certain on the source of this bug - the remedies above are likely not the root problem.
============================================================================================
sql query
select distinct on (target_time)* from forecast
join forecast_value on forecast.id = forecast_value.forecast_id
join model on forecast.model_id = model.id
where model.name = 'National_xg'
and forecast.created_utc >= '2023-02-21'
order by target_time, forecast_value.created_utc desc
some light comparison
after a few days