meteofrance / py4cast

Weather forecasting with Deep Learning
9 stars 10 forks source link

PY4CAST

Unit Tests

This project, built using PyTorch and PyTorch-lightning, is designed to train a variety of Neural Network architectures (GNNs, CNNs, Vision Transformers, ...) on various weather forecasting datasets. This is a Work in Progress, intended to share ideas and design concepts with partners.

Developped at Météo-France by DSM/AI Lab and CNRM/GMAP/PREV.

Contributions are welcome (Issues, Pull Requests, ...).

This project is licensed under the APACHE 2.0 license.

Forecast humidity Forecast precip

Acknowledgements

This project started as a fork of neural-lam, a project by Joel Oskarsson, see here. Many thanks to Joel for his work!

Table of contents

  1. Overview
  2. Features
    1. Neural network architectures
    2. Datasets
    3. Losses
    4. Plots
    5. Training strategies
    6. NamedTensors
  3. Installation
  4. Usage
    1. Docker and runai (MF)
    2. Conda or Micromamba
    3. Specifying your sbatch card
    4. Dataset configuration & simple training
    5. Training options
    6. Experiment tracking
    7. Inference
    8. Making animated plots comparing multiple models
  5. Contributing new features
    1. Adding a neural network architecture
    2. Adding a dataset
    3. Adding plots
  6. Design choices
  7. Unit tests
  8. Continuous Integration

Overview

See here for details on the available datasets, neural networks, training strategies, losses, and explanation of our NamedTensor.

Installation

Start by cloning the repository:

git clone https://github.com/meteofrance/py4cast.git
cd py4cast

Setting environment variables

In order to be able to run the code on different machines, some environment variables can be set. You may add them in your .bashrc or modify them just before launching an experiment.

This should be done by

export PY4CAST_ROOTDIR="/my/dir/"

You MUST export PY4CAST_ROOTDIR to make py4cast work, you can use for instance the existing SCRATCH env var:

export PY4CAST_ROOTDIR=$SCRATCH/py4cast

If PY4CAST_ROOTDIR is not exported py4cast will default to use /scratch/shared/py4cast as its root directory, leading to Exceptions if this directory does not exist or if it is not writable.

At Météo-France

When working at Météo-France, you can use either runai + Docker or Conda/Micromamba to setup a working environment. On the AI Lab cluster we recommend using runai, Conda on our HPC.

See the runai repository for installation instructions.

For HPC, see the related doc (doc/install/install_MF.md) to get the right installation settings.

Install with conda

You can install a conda environment, including py4cast in editable mode, using

conda env create --file env.yaml

From an exixting conda environment, you can now install manually py4cast in development mode using

conda install conda-build -n py4cast
conda develop .

or

pip install --editable .

Install with micromamba

Please install the environment using :

micromamba create -f env.yaml

From an exixting micromamba environment, you can now install manually py4cast in editable mode using

pip install --editable .

Build docker image

To build the docker image please use the oci-image-build.sh script. For Meteo-France user, you should export the variable INJECT_MF_CERT to use the Meteo-France certificate

export INJECT_MF_CERT=1

Then, build with the following command

bash ./oci-image-build.sh --runtime docker

By default, the CUDA and pytorch version are extracted from the env.yaml reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.

Build podman image

As an alternative to docker, you can use podman to build the image.

Click to expand To build the podman image please use the `oci-image-build.sh` script. ```sh bash ./oci-image-build.sh --runtime podman ``` By default, the `CUDA` and `pytorch` version are extracted from the `env.yaml` reference file. Nevertheless, for test purpose, you can set the **PY4CAST_CUDA_VERSION** and **PY4CAST_TORCH_VERSION** to override the default versions.

Convert to Singularity image

From a previously built docker or podman image, you can convert it to the singularity format.

Click to expand To convert the previously built image to a Singularity container, you have to first save the image as a `tar` file: ```sh docker save py4cast:your_tag -o py4cast-your_tag.tar ``` or with podman: ```sh podman save --format oci-archive py4cast:your_tag -o py4cast-your_tag.tar ``` Then, build the singularity image with: ```sh singularity build py4cast-your_tag.sif docker-archive://py4cast-your_tag.tar ``` Please, be sure to get enough free disk space to store the .tar and .sif files.

Usage

Docker

From your py4cast source directory, to run an experiment using the docker image you need to mount in the container :

Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs:

docker run \
    --name hilam-titan \
    --rm \
    --gpus all \
    -v ./${HOME} \
    -v <path-to-datasets>/TITAN:/dataset/TITAN \
    -v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
    -e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
    -e PY4CAST_TITAN_PATH=/dataset/TITAN \
    py4cast:<your_tag> \
    bash -c " \
        pip install -e . &&  \
        python bin/train.py \
            --dataset titan \
            --model hilam \
            --dataset_conf config/datasets/titan_full.json \
            --dev_mode \
            --no_log \
            --num_pred_steps_val_test 1 \
            --num_input_steps 1 \
    "

Podman

Click to expand From your `py4cast` source directory, to run an experiment using the podman image you need to mount in the container : - The dataset path - The py4cast sources - The PY4CAST_ROOTDIR path Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs: ```sh podman run \ --name hilam-titan \ --rm \ --device nvidia.com/gpu=all \ --ipc=host \ --network=host \ -v ./${HOME} \ -v /TITAN:/dataset/TITAN \ -v : \ -e PY4CAST_ROOTDIR= \ -e PY4CAST_TITAN_PATH=/dataset/TITAN \ py4cast: \ bash -c " \ pip install -e . && \ python bin/train.py \ --dataset titan \ --model hilam \ --dataset_conf config/datasets/titan_full.json \ --dev_mode \ --no_log \ --num_pred_steps_val_test 1 \ --num_input_steps 1 \ " ```

Singularity

Click to expand From your `py4cast` source directory, to run an experiment using a singularity container you need to mount in the container : - The dataset path - The PY4CAST_ROOTDIR path Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset: ```sh PY4CAST_TITAN_PATH=/dataset/TITAN \ PY4CAST_ROOTDIR= \ singularity exec \ --nv \ --bind /TITAN:/dataset/TITAN \ --bind : \ py4cast-.sif \ bash -c " \ pip install -e . && \ python bin/train.py \ --dataset titan \ --model hilam \ --dataset_conf config/datasets/titan_full.json \ --dev_mode \ --no_log \ --num_pred_steps_val_test 1 \ --num_input_steps 1 \ " ```

runai

For now this works only for internal Météo-France users.

Click to expand `runai` commands must be issued at the root directory of the `py4cast` project: 1. Run an interactive training session ```bash runai gpu_play 4 runai build runai exec_gpu python bin/train.py --dataset titan --model hilam ``` 2. Train using sbatch single node multi-GPUs ```bash export RUNAI_GRES="gpu:v100:4" runai sbatch python bin/train.py --dataset titan --model hilam ``` 3. Train using sbatch multi nodes multi GPUs Here we use 2 nodes with 4 GPUs each. ```bash export RUNAI_SLURM_NNODES=2 export RUNAI_GRES="gpu:v100:4" runai sbatch_multi_node python bin/train.py --dataset titan --model hilam ``` For the rest of the documentation, you must preprend each python command with `runai exec_gpu`.

Conda or Micromamba

Once your micromamba environment is setup, you should :

A very simple training can be launch (on your current node)

python bin/train.py  --dataset dummy --model halfunet --epochs 2

Example of script to launch on gpu

To do so, you will need to create a small sh script.

#!/usr/bin/bash
#SBATCH --partition=ndl
#SBATCH --nodes=1 # Specify the number of GPU node you required
#SBATCH --gres=gpu:1 # Specify the number of GPU required per Node
#SBATCH --time=05:00:00 # Specify your experiment Time limit
#SBATCH --ntasks-per-node=1 # Specify the number of task per node. This should match the number of GPU Required per Node

# Note that other variable could be set (according to your machine). For example you may need to set the number of CPU or the memory used by your experiment.
# On MF hpc, this is proportional to the number of GPU required per node. This is not the case on other machine (e.g MétéoFrance AILab machine).

source ~/.bashrc  # Be sure that all your environment variables are set
conda activate py4cast # Activate your environment (installed by micromamba or conda)
cd $PY4CAST_PATH # Go to Py4CAST (you can either add an environment variable or hard code it here).
# Launch your favorite command.
srun python bin/train.py --model halfunet --dataset dummy --epochs 2

Then just launch this script using

sbatch my_tiny_script.sh

NB Note that you may have some trouble with SSL certificates (for cartopy). You may need to explicitely export the certificate as :

 export SSL_CERT_FILE="/opt/softs/certificats/proxy1.pem"

with the proxy path depending on your machine.

Dataset configuration & simple training

As in neural-lam, before training you must first compute the mean and std of each feature.

To compute the stats of the Titan dataset:

python py4cast/datasets/titan/__init__.py

To train on a dataset with its default settings just pass the name of the dataset (all lowercase) :

python bin/train.py --dataset titan --model halfunet

You can override the dataset default configuration file:

python bin/train.py --dataset smeagol --model halfunet --dataset_conf config/smeagoldev.json

Details on available datasets.

Training options

  1. Configuring the neural network

To train on a dataset using a network with its default settings just pass the name of the architecture (all lowercase) as shown below:

python bin/train.py --dataset smeagol --model hilam

python bin/train.py --dataset smeagol --model halfunet

You can override some settings of the model using a json config file (here we increase the number of filter to 128 and use ghost modules):

python bin/train.py --dataset smeagol --model halfunet --model_conf config/halfunet128_ghost.json

Details on available neural networks.

  1. Changing the training strategy

You can choose a training strategy using the --strategy STRATEGY_NAME cli argument:

python bin/train.py --dataset smeagol --model halfunet --strategy diff_ar

Details on available training strategies.

  1. Other training options:

You can find more details about all the num_X_steps options here.

Tracking experiment

Tensorboard

We use Tensorboad to track the experiments. You can launch a tensorboard server using the following command:

At Météo-France:

runai will handle port forwarding for you.

runai tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Elsewhere

tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Then you can access the tensorboard server at the following address: http://YOUR_SERVER_IP:YOUR_PORT/

MLFlow

Optionally, you can use MLFlow, in addition to Tensorboard, to track your experiment and log your model. To activate the MLFlow logger simply add the --mlflow_log option on the bin/train.py command line.

Local usage

Without a MLFlow server, the logs are stored in your root path, at PY4CAST_ROOTDIR/logs/mlflow.

With a MLFlow server

If you have a MLFow server you can configure your training environment to push the logs on the remote server. A set of environment variables are available to do that.

For exemple, you can export the following variable in your training environment:

export MLFLOW_TRACKING_URI=https://my.mlflow.server.com/
export MLFLOW_TRACKING_USERNAME=<your-mlflow-user>
export MLFLOW_TRACKING_PASSWORD=<your-mlflow-pwd>
export MLFLOW_EXPERIMENT_NAME=py4cast/unetrpp

Inference

Inference is done by running the bin/inference.py script. This script will load a model and run it on a dataset using the training parameters (dataset config, timestep options, ...).

usage: python bin/inference.py [-h] [--model_path MODEL_PATH] [--dataset DATASET] [--infer_steps INFER_STEPS] [--date DATE]

options:
  -h, --help            show this help message and exit
  --model_path MODEL_PATH
                        Path to the model checkpoint
  --date DATE
                        Date of the sample to infer on. Format:YYYYMMDDHH
  --dataset DATASET
                        Name of the dataset to use (typically the same as has been used for training)
  --dataset_conf DATASET_CONF
                        Name of the dataset config file (json, to change e.g dates, leadtimes, etc)
  --infer_steps INFER_STEPS
                        Number of auto-regressive steps/prediction steps during the inference
   --precision PRECISION
                        floating point precision for the inference (default: 32)
   --grib BOOL
                        Whether the outputs should be saved as grib, needs saving conf.
   --saving_conf SAVING_CONF
                        Name of the config file for write settings (json)

A simple example of inference is shown below:

 runai exec_gpu python bin/inference.py --model_path /scratch/shared/py4cast/logs/camp0/poesy/halfunet/sezn_run_dev_12 --date 2021061621 --dataset poesy_infer --infer_steps 2

Making animated plots comparing multiple models

You can compare multiple trained models on specific case studies and visualize the forecasts on animated plots with the bin/gif_comparison.py. See example of GIF at the beginning of the README.

Warnings:

Usage: gif_comparison.py [-h] --ckpt CKPT --date DATE [--num_pred_steps NUM_PRED_STEPS]

options:
  -h, --help            show this help message and exit
  --ckpt CKPT           Paths to the model checkpoint or AROME
  --date DATE           Date for inference. Format YYYYMMDDHH.
  --num_pred_steps NUM_PRED_STEPS
                        Number of auto-regressive steps/prediction steps.

example: python bin/gif_comparison.py --ckpt AROME --ckpt /.../logs/my_run/epoch=247.ckpt
                                      --date 2023061812 --num_pred_steps 10

Scoring and comparing models

The bin/test.py script will compute and save metrics on the validation set, on as many auto-regressive prediction steps as you want.

python bin/test.py PATH_TO_CHECKPOINT --num_pred_steps 24

Once you have executed the test.py script on all the models you want, you can compare them with bin/scores_comparison.py:

python bin/scores_comparison.py --ckpt PATH_TO_CKPT_0  --ckpt PATH_TO_CKPT_1

Warning: For now bin/scores_comparison.py only works with models trained with Titan dataset

Adding features and contributing

This page explains how to:

Design choices

The figure below illustrates the principal components of the Py4cast architecture.

py4cast

Ideas for future improvements