PY4CAST

This project, built using PyTorch and PyTorch-lightning, is designed to train a variety of Neural Network architectures (GNNs, CNNs, Vision Transformers, ...) on various weather forecasting datasets. This is a Work in Progress, intended to share ideas and design concepts with partners.

Developped at Météo-France by DSM/AI Lab and CNRM/GMAP/PREV.

Contributions are welcome (Issues, Pull Requests, ...).

This project is licensed under the APACHE 2.0 license.

Forecast humidity Forecast precip

Acknowledgements

This project started as a fork of neural-lam, a project by Joel Oskarsson, see here. Many thanks to Joel for his work!

Overview

7 neural network architectures : Half-Unet, U-Net, SegFormer, SwinUnetR, HiLam, GraphLam, UnetR++
1 dataset with samples available on Huggingface : Titan
2 training strategies : Scaled Auto-regressive steps, Differential Auto-regressive steps
4 losses: Scaled RMSE, Scaled L1, Weighted MSE, Weighted L1
neural networks as simple torch.nn.Module
training with pytorchlightning
simple interfaces to easily add a new dataset, neural network, training strategy or loss
simple command line to lauch a training
config files to change the parameters of your dataset or neural network during training
experiment tracking with tensorboard and plots of forecasts with matplotlib
implementation of NamedTensors to tracks features and dimensions of tensors at each step of the training

See here for details on the available datasets, neural networks, training strategies, losses, and explanation of our NamedTensor.

Installation

Start by cloning the repository:

git clone https://github.com/meteofrance/py4cast.git
cd py4cast

Setting environment variables

In order to be able to run the code on different machines, some environment variables can be set. You may add them in your .bashrc or modify them just before launching an experiment.

PY4CAST_ROOTDIR : Specify the ROOT DIR for your experiment. It also modifies the CACHE_DIR. This is where the files created during the experiment will be stored.
PY4CAST_SMEAGOL_PATH: Specify where the smeagol dataset is stored. Only needed if you want to use the smeagol dataset.
PY4CAST_TITAN_PATH: Specify where the titan dataset is stored. Only needed if you want to use the titan dataset.

This should be done by

export PY4CAST_ROOTDIR="/my/dir/"

You MUST export PY4CAST_ROOTDIR to make py4cast work, you can use for instance the existing SCRATCH env var:

export PY4CAST_ROOTDIR=$SCRATCH/py4cast

If PY4CAST_ROOTDIR is not exported py4cast will default to use /scratch/shared/py4cast as its root directory, leading to Exceptions if this directory does not exist or if it is not writable.

At Météo-France

When working at Météo-France, you can use either runai + Docker or Conda/Micromamba to setup a working environment. On the AI Lab cluster we recommend using runai, Conda on our HPC.

See the runai repository for installation instructions.

For HPC, see the related doc (doc/install/install_MF.md) to get the right installation settings.

Install with conda

You can install a conda environment, including py4cast in editable mode, using

conda env create --file env.yaml

From an exixting conda environment, you can now install manually py4cast in development mode using

conda install conda-build -n py4cast
conda develop .

pip install --editable .

Install with micromamba

Please install the environment using :

micromamba create -f env.yaml

From an exixting micromamba environment, you can now install manually py4cast in editable mode using

pip install --editable .

Build docker image

To build the docker image please use the oci-image-build.sh script. For Meteo-France user, you should export the variable INJECT_MF_CERT to use the Meteo-France certificate

export INJECT_MF_CERT=1

Then, build with the following command

bash ./oci-image-build.sh --runtime docker

By default, the CUDA and pytorch version are extracted from the env.yaml reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.

Build podman image

As an alternative to docker, you can use podman to build the image.

Click to expand

To build the podman image please use the `oci-image-build.sh` script. ```sh bash ./oci-image-build.sh --runtime podman ``` By default, the `CUDA` and `pytorch` version are extracted from the `env.yaml` reference file. Nevertheless, for test purpose, you can set the **PY4CAST_CUDA_VERSION** and **PY4CAST_TORCH_VERSION** to override the default versions.

Convert to Singularity image

From a previously built docker or podman image, you can convert it to the singularity format.

Click to expand

To convert the previously built image to a Singularity container, you have to first save the image as a `tar` file: ```sh docker save py4cast:your_tag -o py4cast-your_tag.tar ``` or with podman: ```sh podman save --format oci-archive py4cast:your_tag -o py4cast-your_tag.tar ``` Then, build the singularity image with: ```sh singularity build py4cast-your_tag.sif docker-archive://py4cast-your_tag.tar ``` Please, be sure to get enough free disk space to store the .tar and .sif files.

Usage

Docker

From your py4cast source directory, to run an experiment using the docker image you need to mount in the container :

The dataset path
The py4cast sources
The PY4CAST_ROOTDIR path

Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs:

docker run \
    --name hilam-titan \
    --rm \
    --gpus all \
    -v ./${HOME} \
    -v <path-to-datasets>/TITAN:/dataset/TITAN \
    -v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
    -e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
    -e PY4CAST_TITAN_PATH=/dataset/TITAN \
    py4cast:<your_tag> \
    bash -c " \
        pip install -e . &&  \
        python bin/train.py \
            --dataset titan \
            --model hilam \
            --dataset_conf config/datasets/titan_full.json \
            --dev_mode \
            --no_log \
            --num_pred_steps_val_test 1 \
            --num_input_steps 1 \
    "

Podman

Click to expand

From your `py4cast` source directory, to run an experiment using the podman image you need to mount in the container : - The dataset path - The py4cast sources - The PY4CAST_ROOTDIR path Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs: ```sh podman run \ --name hilam-titan \ --rm \ --device nvidia.com/gpu=all \ --ipc=host \ --network=host \ -v ./${HOME} \ -v /TITAN:/dataset/TITAN \ -v : \ -e PY4CAST_ROOTDIR= \ -e PY4CAST_TITAN_PATH=/dataset/TITAN \ py4cast: \ bash -c " \ pip install -e . && \ python bin/train.py \ --dataset titan \ --model hilam \ --dataset_conf config/datasets/titan_full.json \ --dev_mode \ --no_log \ --num_pred_steps_val_test 1 \ --num_input_steps 1 \ " ```

Singularity

Click to expand

From your `py4cast` source directory, to run an experiment using a singularity container you need to mount in the container : - The dataset path - The PY4CAST_ROOTDIR path Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset: ```sh PY4CAST_TITAN_PATH=/dataset/TITAN \ PY4CAST_ROOTDIR= \ singularity exec \ --nv \ --bind /TITAN:/dataset/TITAN \ --bind : \ py4cast-.sif \ bash -c " \ pip install -e . && \ python bin/train.py \ --dataset titan \ --model hilam \ --dataset_conf config/datasets/titan_full.json \ --dev_mode \ --no_log \ --num_pred_steps_val_test 1 \ --num_input_steps 1 \ " ```

runai

For now this works only for internal Météo-France users.

Click to expand

`runai` commands must be issued at the root directory of the `py4cast` project: 1. Run an interactive training session ```bash runai gpu_play 4 runai build runai exec_gpu python bin/train.py --dataset titan --model hilam ``` 2. Train using sbatch single node multi-GPUs ```bash export RUNAI_GRES="gpu:v100:4" runai sbatch python bin/train.py --dataset titan --model hilam ``` 3. Train using sbatch multi nodes multi GPUs Here we use 2 nodes with 4 GPUs each. ```bash export RUNAI_SLURM_NNODES=2 export RUNAI_GRES="gpu:v100:4" runai sbatch_multi_node python bin/train.py --dataset titan --model hilam ``` For the rest of the documentation, you must preprend each python command with `runai exec_gpu`.

Conda or Micromamba

Once your micromamba environment is setup, you should :

activate your environment conda activate py4cast or micromamba activate nlam
launch a training

A very simple training can be launch (on your current node)

python bin/train.py  --dataset dummy --model halfunet --epochs 2

Example of script to launch on gpu

To do so, you will need to create a small sh script.

#!/usr/bin/bash
#SBATCH --partition=ndl
#SBATCH --nodes=1 # Specify the number of GPU node you required
#SBATCH --gres=gpu:1 # Specify the number of GPU required per Node
#SBATCH --time=05:00:00 # Specify your experiment Time limit
#SBATCH --ntasks-per-node=1 # Specify the number of task per node. This should match the number of GPU Required per Node

# Note that other variable could be set (according to your machine). For example you may need to set the number of CPU or the memory used by your experiment.
# On MF hpc, this is proportional to the number of GPU required per node. This is not the case on other machine (e.g MétéoFrance AILab machine).

source ~/.bashrc  # Be sure that all your environment variables are set
conda activate py4cast # Activate your environment (installed by micromamba or conda)
cd $PY4CAST_PATH # Go to Py4CAST (you can either add an environment variable or hard code it here).
# Launch your favorite command.
srun python bin/train.py --model halfunet --dataset dummy --epochs 2

Then just launch this script using

sbatch my_tiny_script.sh

NB Note that you may have some trouble with SSL certificates (for cartopy). You may need to explicitely export the certificate as :

 export SSL_CERT_FILE="/opt/softs/certificats/proxy1.pem"

with the proxy path depending on your machine.

Dataset configuration & simple training

As in neural-lam, before training you must first compute the mean and std of each feature.

To compute the stats of the Titan dataset:

python py4cast/datasets/titan/__init__.py

To train on a dataset with its default settings just pass the name of the dataset (all lowercase) :

python bin/train.py --dataset titan --model halfunet

You can override the dataset default configuration file:

python bin/train.py --dataset smeagol --model halfunet --dataset_conf config/smeagoldev.json

Details on available datasets.

Training options

Configuring the neural network

To train on a dataset using a network with its default settings just pass the name of the architecture (all lowercase) as shown below:

python bin/train.py --dataset smeagol --model hilam

python bin/train.py --dataset smeagol --model halfunet

You can override some settings of the model using a json config file (here we increase the number of filter to 128 and use ghost modules):

python bin/train.py --dataset smeagol --model halfunet --model_conf config/halfunet128_ghost.json

Details on available neural networks.

Changing the training strategy

You can choose a training strategy using the --strategy STRATEGY_NAME cli argument:

python bin/train.py --dataset smeagol --model halfunet --strategy diff_ar

Details on available training strategies.

Other training options:

--seed SEED random seed (default: 42)
--loss LOSS Loss function to use (default: mse)
--lr LR learning rate (default: 0.001)
--val_interval VAL_INTERVAL Number of epochs training between each validation run (default: 1)
--epochs EPOCHS upper epoch limit (default: 200)
--profiler PROFILER Profiler required. Possibilities are ['simple', 'pytorch', 'None']
--batch_size BATCH_SIZE batch size
--precision PRECISION Numerical precision to use for model (32/16/bf16) (default: 32)
--limit_train_batches LIMIT_TRAIN_BATCHES Number of batches to use for training
--num_pred_steps_train NUM_PRED_STEPS_TRAIN Number of auto-regressive steps/prediction steps during training forward pass
--num_pred_steps_val_test NUM_PRED_STEPS_VAL_TEST Number of auto-regressive steps/prediction steps during validation and tests
--num_input_steps NUM_INPUT_STEPS Number of previous timesteps supplied as inputs to the model
--num_inter_steps NUM_INTER_STEPS Number of model steps between two samples
--no_log When activated, logs are not stored and models are not saved. Use in dev mode. (default: False)
--mlflow_log When activated, the MLFlowLogger is used and the model is saved in the MLFlow style (default: False)
--dev_mode When activated, reduce number of epoch and steps. (default: False)
--load_model_ckpt LOAD_MODEL_CKPT Path to load model parameters from (default: None)

You can find more details about all the num_X_steps options here.

Tracking experiment

Tensorboard

We use Tensorboad to track the experiments. You can launch a tensorboard server using the following command:

At Météo-France:

runai will handle port forwarding for you.

runai tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Elsewhere

tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Then you can access the tensorboard server at the following address: http://YOUR_SERVER_IP:YOUR_PORT/

MLFlow

Optionally, you can use MLFlow, in addition to Tensorboard, to track your experiment and log your model. To activate the MLFlow logger simply add the --mlflow_log option on the bin/train.py command line.

Local usage

Without a MLFlow server, the logs are stored in your root path, at PY4CAST_ROOTDIR/logs/mlflow.

With a MLFlow server

If you have a MLFow server you can configure your training environment to push the logs on the remote server. A set of environment variables are available to do that.

For exemple, you can export the following variable in your training environment:

export MLFLOW_TRACKING_URI=https://my.mlflow.server.com/
export MLFLOW_TRACKING_USERNAME=<your-mlflow-user>
export MLFLOW_TRACKING_PASSWORD=<your-mlflow-pwd>
export MLFLOW_EXPERIMENT_NAME=py4cast/unetrpp

Inference

Inference is done by running the bin/inference.py script. This script will load a model and run it on a dataset using the training parameters (dataset config, timestep options, ...).

usage: python bin/inference.py [-h] [--model_path MODEL_PATH] [--dataset DATASET] [--infer_steps INFER_STEPS] [--date DATE]

options:
  -h, --help            show this help message and exit
  --model_path MODEL_PATH
                        Path to the model checkpoint
  --date DATE
                        Date of the sample to infer on. Format:YYYYMMDDHH
  --dataset DATASET
                        Name of the dataset to use (typically the same as has been used for training)
  --dataset_conf DATASET_CONF
                        Name of the dataset config file (json, to change e.g dates, leadtimes, etc)
  --infer_steps INFER_STEPS
                        Number of auto-regressive steps/prediction steps during the inference
   --precision PRECISION
                        floating point precision for the inference (default: 32)
   --grib BOOL
                        Whether the outputs should be saved as grib, needs saving conf.
   --saving_conf SAVING_CONF
                        Name of the config file for write settings (json)

A simple example of inference is shown below:

 runai exec_gpu python bin/inference.py --model_path /scratch/shared/py4cast/logs/camp0/poesy/halfunet/sezn_run_dev_12 --date 2021061621 --dataset poesy_infer --infer_steps 2

Making animated plots comparing multiple models

You can compare multiple trained models on specific case studies and visualize the forecasts on animated plots with the bin/gif_comparison.py. See example of GIF at the beginning of the README.

Warnings:

For now this script only works with models trained with Titan dataset.
If you want to use AROME as a model, you have to manually download the forecast before.

Usage: gif_comparison.py [-h] --ckpt CKPT --date DATE [--num_pred_steps NUM_PRED_STEPS]

options:
  -h, --help            show this help message and exit
  --ckpt CKPT           Paths to the model checkpoint or AROME
  --date DATE           Date for inference. Format YYYYMMDDHH.
  --num_pred_steps NUM_PRED_STEPS
                        Number of auto-regressive steps/prediction steps.

example: python bin/gif_comparison.py --ckpt AROME --ckpt /.../logs/my_run/epoch=247.ckpt
                                      --date 2023061812 --num_pred_steps 10

Scoring and comparing models

The bin/test.py script will compute and save metrics on the validation set, on as many auto-regressive prediction steps as you want.

python bin/test.py PATH_TO_CHECKPOINT --num_pred_steps 24

Once you have executed the test.py script on all the models you want, you can compare them with bin/scores_comparison.py:

python bin/scores_comparison.py --ckpt PATH_TO_CKPT_0  --ckpt PATH_TO_CKPT_1

Warning: For now bin/scores_comparison.py only works with models trained with Titan dataset

Adding features and contributing

This page explains how to:

add a new neural network
add a new dataset
contribute to this project following our guidelines

Design choices

The figure below illustrates the principal components of the Py4cast architecture.

py4cast

We define interface contracts between the components of the system using Python ABCs. As long as the Python classes respect the interface contract, they can be used interchangeably in the system and the underlying implementation can be very different. For instance datasets with any underlying storage (grib2, netcdf, mmap+numpy, ...) and real-time or ahead of time concat and pre-processing could be used with the same neural network architectures and training strategies.
Adding a model, a dataset, a loss, a plot, a training strategy, ... should be as simple as creating a new Python class that complies with the interface contract.
Dataset produce Item, collated into ItemBatch, both having NamedTensor attributes.
Dataset produce tensors with the following dimensions: (batch, timestep, lat, lon, features). Models can flatten or reshape spatial dimension in the prepare_batch but the rest of the system expects features to be always the last dimension of the tensors.
Neural network architectures are Python classes that inherit from both ModelABC and PyTorch's nn.Module. The later means it is quick to insert a third-party pure PyTorch model in the system (see for instance the code for Lucidrains' Segformer or a U-Net).
We use dataclasses and dataclass_json to define the settings whenever possible. This allows us to easily serialize and deserialize the settings to/from json files with Schema validation.
The NamedTensor allows us to keep track of the physical/weather parameters along the features dimension and to pass a single consistent object in the system. It is also a way to factorize common operations on tensors (concat along features dimension, flatten in place, ...) while keeping the dimension and feature names metadata in sync.
We use PyTorch-lightning to train the models. This allows us to easily scale the training to multiple GPUs and to use the same training loop for all the models. We also use the PyTorch-lightning logging system to log the training metrics and the hyperparameters.

Ideas for future improvements

Ideally, we could end up with a simple based class system for the training strategies to allow for easy addition of new strategies.
The ItemBatch class attributes could be generalized to have multiple inputs, outputs and forcing tensors referenced by name, this would allow for more flexibility in the models and plug metnet-3 and Pangu.
The distinction between prognostic and diagnostic variables should be made explicit in the system.
We should probably reshape back the GNN outputs to (lat, lon) gridded shape as early as possible to have this as a common/standard output format for all the models. This would simplify the post-processing, plotting, ... We still have if statements in the code to handle the different output shapes of the models.

meteofrance / py4cast

readme

PY4CAST

Acknowledgements

Table of contents

Overview

Installation

Setting environment variables

At Météo-France

Install with conda

Install with micromamba

Build docker image

Build podman image

Convert to Singularity image

Usage

Docker

Podman

Singularity

runai

Conda or Micromamba

Example of script to launch on gpu

Dataset configuration & simple training

Training options

Tracking experiment

Tensorboard

MLFlow

Inference

Making animated plots comparing multiple models

Scoring and comparing models

Adding features and contributing

Design choices

Ideas for future improvements