Training Process Not Completing Successfully

AndranikUgujyan commented 1 year ago

Hi @dfulu , thank you for you response, I really appropriate it.

I am encountering an issue while attempting to train the PVNet model using only GSP data and excluding NWP and satellite images. I followed the steps below to modify the configuration and model settings:

Updated the PVNet/configs/datamodule/ocf_datapipes.yaml file on line 2, changing the configuration path to the local path: /PVNet/configs/datamodule/configuration/gcp_configuration.yaml.
Inside the /content/PVNet/configs/datamodule/configuration/gcp_configuration.yaml file, I changed the value of gsp_zarr_path on line number 10 to the local path.
Next, I went to /PVNet/pvnet/models/multimodal/multimodal.py and set include_sat and model.include_nwp to False.

However, when I attempted to run the command using the experiment.sh script as follows:

cd scripts python save_batches.py \ +batch_output_dir="/mnt/disks/batches2/batches_v3.1" \ +num_train_batches=50_000 +num_val_batches=2_000

I encountered the following error:

KeyError: ".zmetadata\nThis exception is thrown by iter of OpenNWPIterDataPipe(zarr_path='/mnt/disks/nwp/UKV_intermediate_version_7.zarr')"

But I proceeded to remove the path from gcp_configuration.yaml.

Afterward, I ran the training process using the command:

python run.py \ datamodule=premade_batches \ datamodule.batch_dir="/content/gdrive/MyDrive/batches2/batches_v0" \ +trainer.val_check_interval=10_000 \ trainer.log_every_n_steps=200 \ callbacks.early_stopping.patience=20 \ datamodule.batch_size=32 \ trainer.accumulate_grad_batches=4 \ model_name="pvnet+ResFC2+_slow_regx25_amsgrad_v4"

However, this time, the training process didn't complete successfully inside Colab with GPU setup.

Questions:

I would like to understand the reason for the training process not completing successfully when excluding NWP and satellite image data and training only on GSP data.
Could you please provide a more detailed explanation of the entire running process?
As the PVNet model is quite complex, I am looking for a simplified way to train it exclusively on GSP data. Later, I plan to include NWP and satellite image data to train the full model.
I would appreciate any assistance in resolving this issue and gaining a better understanding of the training process.
Is there a difference between hugging face satellite data and the data that you use in PVNet2 (in gcp_configuration.yaml, satellite_zarr_path)?

Thank you!

dfulu commented 1 year ago

Hi @AndranikUgujyan

You're pretty much along the right lines

So the current process for training is first to presave batches using the save_batches.py script as you have tried. This uses the ocf_datapipes library to create batches of data and then it saves them out a dictionaries of pytorch tensors. Each batch is saved to a separate file inside <batch_output_dir>/train or <batch_output_dir>/val.

Then we run the training process using the run.py script.

Unfortunately, you likely won't have access to the NWP data our local path "/mnt/disks/nwp/UKV_intermediate_version_7.zarr". This data comes from the UK MetOffice and we are not allowed to share it. Maybe later on, you could try to add in NWP data from the ICON NWP model. We can share this and it is being archived here on hugging face. However, there would likely be some work involved to process this dataset into a format useable by the model.

It is the same satellite data in huggingface that we used to train the model. However, it would probably be easier to pull this data from where it is hosted as a google public dataset here.

In terms of your issues, you didn't mention whether you managed to create any batches using the save_batches.py script. Presuming you did, I think you might want something like this to train:

python run.py \
  datamodule=premade_batches \
  datamodule.batch_dir="<your batch dir>" \
  +trainer.val_check_interval=10_000 \
  trainer.log_every_n_steps=200 \
  callbacks.early_stopping.patience=20 \
  datamodule.batch_size=32 \
  datamodule.num_workers=<number of workers based on your number of CPUs> \
  logger.wandb.project="<your wandb project>" \
  logger.wandb.save_dir="<your wandb logging save dir>" \
  callbacks.model_checkpoint.dirpath="<your checkpoint directory>/${model_name}" \
  model_name="<your model name>"

You can see that you'll need to set up directories to store the model checkpoints and training logs. We have set this up to use wandb for logging, so you may want to sign up for a wandb account. For individual use they have a free plan.

This might still cause you some errors. Since we are using it for our own research we haven't tested it outside our own environments.

dfulu commented 1 year ago

@AndranikUgujyan, sorry for the delay in answering this comment. I noticed you asked the same thing in our discussion in #37, so lets continue the conversation over there

openclimatefix / PVNet

Training Process Not Completing Successfully #60