This repo contains scripts and notebooks for formatting and verifying the backtest data produced by national solar forecasting models from Open Climate Fix.
National solar forecasting backtests can be run using OCFs PVNet and NationalXG models. In the PVNet repo, under scripts, there is the file gsp_run_backtest.py
. This script can be used to run the backtests by setting the models and the dates ranges to use.
For PVNet there is one model to run the GSP level forecasts and another model, called the summation model, which is used to aggregate the GSP level forecasts to a national level. Each of these models checkpoints can be downloaded locally before running the backtest or can be streamed in from Hugging Face.
The model requires a specific configuration file called data_config.yml
. This file defines:
The configuration file must exactly match the settings used during model training for the backtest to run correctly.
As backtests can take a long time to run, it is best to used a environment like tmux
to run the backtests. This allows you to keep the job running even if the SSH connection is lost.
After installing tmux
you can create a new session with:
tmux new -s [SESSION_NAME]
Then activate the appropriate conda environment to run the backtest. Once you have created and are inside a tmux
session you can run the backtest with:
python run_backtest.py
The progress of the backtest can be viewed by reconnecting to the tmux
session with:
tmux attach -n [SESSION_NAME]
It can be useful to inspect how much of the machines resources are being used via top
or htop
. This shows the CPU and RAM usage, which is useful when optimising the number of workers and batches.
terminate called without an active exception
, this is likely due to memory issues. This can be fixed by reducing the number of workers or batches or by increasing the machines resources.compare_forecast_mae.ipynb
notebook. Previous backtest data can be found in the google storage bucket under solar-pv-nowcasting-data/backtest/
. This helps to validate things are as expected before kicking off a larger backtest.For PVNet, processing and formatting scripts are found in /scripts/pvnet_prob/
and consists of the 4 steps below:
compile_raw_files.py
)filter_zarr_to_csv.py
)merge_and_blend.py
)format_forecast.py
)PVNet produces a single netcdf file (.nc) per initialisation time. These files need to be combined together. The script to do this is called compile_raw_files.py
. This will produce a zarr file containing the data.
The filter_zarr_to_csv.py
script turns the data from a zarr into a csv, keeping just the national forecast rather than the GSP level forecasts. This needs to be performed for the Intraday and Dayahead forecasts separately.
Once the files are in the correct format, the merge_and_blend_prob.py
script can be used. This merges the two datasets together and blends the forecasts together based on defined weightings at different forecast horizons in the script.
The data then needs to run through a last formatting script called, format_forecast.py
. This script adds the PVLive installed capacity and outputs the final forecast file.
/scripts/archived_scripts/
folder.Scripts have been written for interpolating hourly forecasts to half hourly interpolate_30min.py
and for unnormalising forecasts using the installed capacity for PVLive unnorm_forecast.py
.
Notebooks for verifying the data and comparing forecasts is found in /notebooks/
check_blending.ipynb
can be used to verify the blending of the forecasts.check_forecast_consistency.ipynb
can be used to check data quality.compare_forecast_mae.ipynb
can be used to compare the error of the forecasts to previous forecasts and models. Previous forecasts can be moved to the /data/compare_forecasts
folder to use this notebook.Check for missing data in the backtest using the missing_data.py
file. This script checks the data for gaps in the forecasts and outputs a csv detailing the size and start of the gaps.
To name the file in the standardised format, use the rename_forecast_file.py
script. For model version numbers, the pvnet_app version number is used.
After running a backtest, the raw data can be uploaded to Google Storage. The gsutil
command line tool can be used:
gsutil -m cp -r [LOCAL_FILE_PATH] gs://[BUCKET_NAME]/[OBJECT_NAME]
The -m
flag enables parallel multi-threading, allowing multiple files to be transferred simultaneously which significantly speeds up the transfer.
Data can then be downloaded onto another machine for processing.
Part of the Open Climate Fix community.