Closed AndranikUgujyan closed 1 year ago
Hi @AndranikUgujyan
You're pretty much along the right lines
So the current process for training is first to presave batches using the save_batches.py
script as you have tried. This uses the ocf_datapipes
library to create batches of data and then it saves them out a dictionaries of pytorch tensors. Each batch is saved to a separate file inside <batch_output_dir>/train
or <batch_output_dir>/val
.
Then we run the training process using the run.py
script.
Unfortunately, you likely won't have access to the NWP data our local path "/mnt/disks/nwp/UKV_intermediate_version_7.zarr". This data comes from the UK MetOffice and we are not allowed to share it. Maybe later on, you could try to add in NWP data from the ICON NWP model. We can share this and it is being archived here on hugging face. However, there would likely be some work involved to process this dataset into a format useable by the model.
It is the same satellite data in huggingface that we used to train the model. However, it would probably be easier to pull this data from where it is hosted as a google public dataset here.
In terms of your issues, you didn't mention whether you managed to create any batches using the save_batches.py
script. Presuming you did, I think you might want something like this to train:
python run.py \
datamodule=premade_batches \
datamodule.batch_dir="<your batch dir>" \
+trainer.val_check_interval=10_000 \
trainer.log_every_n_steps=200 \
callbacks.early_stopping.patience=20 \
datamodule.batch_size=32 \
datamodule.num_workers=<number of workers based on your number of CPUs> \
logger.wandb.project="<your wandb project>" \
logger.wandb.save_dir="<your wandb logging save dir>" \
callbacks.model_checkpoint.dirpath="<your checkpoint directory>/${model_name}" \
model_name="<your model name>"
You can see that you'll need to set up directories to store the model checkpoints and training logs. We have set this up to use wandb for logging, so you may want to sign up for a wandb account. For individual use they have a free plan.
This might still cause you some errors. Since we are using it for our own research we haven't tested it outside our own environments.
@AndranikUgujyan, sorry for the delay in answering this comment. I noticed you asked the same thing in our discussion in #37, so lets continue the conversation over there
Hi @dfulu , thank you for you response, I really appropriate it.
I am encountering an issue while attempting to train the PVNet model using only GSP data and excluding NWP and satellite images. I followed the steps below to modify the configuration and model settings:
However, when I attempted to run the command using the experiment.sh script as follows:
I encountered the following error:
But I proceeded to remove the path from gcp_configuration.yaml.
Afterward, I ran the training process using the command:
However, this time, the training process didn't complete successfully inside Colab with GPU setup.
Questions:
Thank you!