No nwp_zarr_path: /mnt/disks/nwp/UKV_intermediate_version_7.zarr

AndranikUgujyan commented 1 year ago

Describe the bug

I am trying to run PVNet based on the README, but I am encountering an error. Furthermore, there is no adequate description of how to run the code.

The files involved are as follows:

Inside: configs/datamodule/ocf_datapipes.yaml tests/data/sample_batches/datamodule.yaml Additionally, there is an incorrect path specified: /home/jamesfulton/repos/.

Has anyone successfully executed the code according to the instructions provided in the README?

peterdudfield commented 1 year ago

Hi @AndranikUgujyan

Unfortunately, that data isnt public and we cant make it public. One way to it would be to use GFS or a different NWP data supply.

You probably find some GSP data that you might need. This can be collected from PVlive

AndranikUgujyan commented 1 year ago

Hi @peterdudfield ,

Thank you for your message. I am impressed with the topic you are developing, and I am keen to implement PVNet or a similar AI approach for Armenia. Could you please advise me on where I can start and how I can succeed in this endeavor?

From my understanding, the input for PVNet includes:

GSP ID
GSP History
Solar Coords
4D satellite data
4D NWP data

The output is a 1D prediction. However, I find it challenging to delve deeper into the specifics of each data and the vector being predicted. Additionally, I am unsure of where I can obtain the necessary data for Armenia.

I kindly request your guidance on how to initiate this research and achieve success.

Thank you in advance for your support.

peterdudfield commented 1 year ago

and 2. GSP stands for Grid supply point, as it perhaps very UK specific. If you had national PV output from Armenia then that could be used instead. If you have the region output of Armenia for Solar generation, then you could use that for each GSP. Perhaps start with just the National?
What Solar coordiantes are you look for? , or where in the code is this?
Yea, you could use this dataset that we have release
You could download free GFS data using Herbie first of all?

Does this help?

AndranikUgujyan commented 1 year ago

Thank you, @peterdudfield , for your assistance. Now everything makes sense.

I am currently trying to run PVNet on Colab. I have used the nowcasting_dataset and downloaded the PV_GSP data. Additionally, I have configured the "should_pretrain = True" parameter to prevent the dataloader from loading the requested NWP and satellite data.

However, I am facing difficulties in completing the training process on Colab.

Could you please advise if it is possible to run pretraining data on Colab? Furthermore, are there any additional documents available that provide explanations on how to train PVNet and make predictions?

P.S. Even after running the predictions, I am having trouble understanding the meaning of the output. Could you please explain what the output signifies?

peterdudfield commented 1 year ago

I dont know if it can be run on Colab, but I would assume so.

@dfulu would be able to advice on how to run the training and how to interpret the predictions. This might be useful in general and maybe @dfulu you could add this to the readme.md?

dfulu commented 1 year ago

Hi @AndranikUgujyan, the PV_GSP data is basically what we try to predict. This loaded and processed using the ocf_datapipes library.

The PV_GSP dataset contains the PV power output at each timestep for each of the GSPs. A GSP is essentially just a geographical region of the UK, and there are 317 of them. The power output is the sum of PV power generated within each region averaged over each 30-minute period.

In the dataset:

gsp_id: An ID number to identify each GSP region
datetime_gmt: The timestamp corresponding to the end of each 30-minute period of power production. It is stated in the GMT timezone - usually called UTC nowadays.
generation_mw*: The average PV power generated in each 30-minute period for each GSP
installedcapacity_mwp*: The total installed capacity of PV power - essentially take the sum of all power-producing capacity installed PV panels within each GSP
capacity_mwp*: The effective capacity of the PV panels in each GSP. Panels break, or go offline, or just get old and inefficient and therefore produce less power than their installed capacity. This is an estimate of the capacity of each GSP taking into account these effects.

* estimated by PVLive

updated_gmt: I'm not 100% sure, but I think this is the time at which PVLive made its estimate of the above above quantities. As PVLive get more data from some of the big solar farms - measurements of what their panels actually produced, they update these quantities to better reflect this new data. I think updated_gmt is the timestamp of the last update. updated_gmt will always be a time after datetime_gmt.

We try to predict generation_mw in the future. The power that these panels will produce later. Currently we are actually training the model to predict generation_mw/capacity_mwp because the capacities of the each GSP are quite different. In production we multiply the predictions by capacity_mwp to produce a future forecast of generation_mw. The model predicts the PV output for multiple steps ahead. By default it will be predicting for 16 20-minute steps, so up to 8 hours ahead. This can be changed in the config files however.

Currently there is no documentation outside the library itself and whats included in the READMEs. We are a small organisation, and we haven't had the capacity to write it up yet.

I can't think of any reason as to why it shouldn't run on colab, although I haven't done it, so I don't really know.

Finally, I think there should be a better way to train the model using just PV_GSP inputs than setting should_pretrain = True. The commands in experiment.sh are the ones I have used to train models. These and the information in the configs configure the model and the dataset. To remove the sat and NWP inputs you could set model.include_sat and model.include_nwp to false. This tells the model not to expect satellite or NWP data inputs. You also need to configure the data pipeline not to load the satellite or NWP data. IIRC this can be done by setting datamodule.configuration.nwp.nwp_zarr_path and datamodule.configuration.satellite.satellite_zarr_path to empty strings, then it won't try to load NWP or satellite data.

There are some other things in the configs that you might have to change for your own environment. The way it is configured now, it logs model results to wandb. In the configs, that I'd encourage you to explore, you can set the logger to use tensorboard instead.

dfulu commented 1 year ago

Hi @AndranikUgujyan

You're pretty much along the right lines

So the current process for training is first to presave batches using the save_batches.py script as you have tried. This uses the ocf_datapipes library to create batches of data and then it saves them out a dictionaries of pytorch tensors. Each batch is saved to a separate file inside <batch_output_dir>/train or <batch_output_dir>/val.

Then we run the training process using the run.py script.

Unfortunately, you likely won't have access to the NWP data our local path "/mnt/disks/nwp/UKV_intermediate_version_7.zarr". This data comes from the UK MetOffice and we are not allowed to share it. Maybe later on, you could try to add in NWP data from the ICON NWP model. We can share this and it is being archived here on hugging face. However, there would likely be some work involved to process this dataset into a format useable by the model.

It is the same satellite data in huggingface that we used to train the model. However, it would probably be easier to pull this data from where it is hosted as a google public dataset here.

In terms of your issues, you didn't mention whether you managed to create any batches using the save_batches.py script. Presuming you did, I think you might want something like this to train:

python run.py \
  datamodule=premade_batches \
  datamodule.batch_dir="<your batch dir>" \
  +trainer.val_check_interval=10_000 \
  trainer.log_every_n_steps=200 \
  callbacks.early_stopping.patience=20 \
  datamodule.batch_size=32 \
  datamodule.num_workers=<number of workers based on your number of CPUs> \
  logger.wandb.project="<your wandb project>" \
  logger.wandb.save_dir="<your wandb logging save dir>" \
  callbacks.model_checkpoint.dirpath="<your checkpoint directory>/${model_name}" \
  model_name="<your model name>"

You can see that you'll need to set up directories to store the model checkpoints and training logs. We have set this up to use wandb for logging, so you may want to sign up for a wandb account. For individual use they have a free plan.

This might still cause you some errors. Since we are using it for our own research we haven't tested it outside our own environments. Hopefully this puts you in the right direction though

openclimatefix / PVNet

No nwp_zarr_path: /mnt/disks/nwp/UKV_intermediate_version_7.zarr #37

Describe the bug