meteofrance / py4cast

Weather forecasting with Deep Learning
6 stars 8 forks source link

PY4CAST

Unit Tests

This project, built using PyTorch and PyTorch-lightning, is designed to train a variety of Neural Network architectures (GNNs, CNNs, Vision Transformers, ...) on various weather forecasting datasets. This is a Work in Progress, intended to share ideas and design concepts with partners.

Developped at Météo-France by DSM/AI Lab and CNRM/GMAP/PREV.

Contributions are welcome (Issues, Pull Requests, ...).

This project is licensed under the APACHE 2.0 license.

Forecast humidity Forecast precip

Acknowledgements

This project started as a fork of neural-lam, a project by Joel Oskarsson, see here. Many thanks to Joel for his work!

Table of contents

  1. Overview
  2. Features
    1. Neural network architectures
    2. Datasets
    3. Losses
    4. Plots
    5. Training strategies
    6. NamedTensors
  3. Installation
  4. Usage
    1. Docker and runai (MF)
    2. Conda or Micromamba
    3. Specifying your sbatch card
    4. Dataset configuration & simple training
    5. Training options
    6. Experiment tracking
    7. Inference
    8. Making animated plots comparing multiple models
  5. Contributing new features
    1. Adding a neural network architecture
    2. Adding a dataset
    3. Adding plots
  6. Design choices
  7. Unit tests
  8. Continuous Integration

Overview

See here for details on the available datasets, neural networks, training strategies, losses, and explanation of our NamedTensor.

Installation

Start by cloning the repository:

git clone https://github.com/meteofrance/py4cast.git
cd py4cast

Setting environment variables

In order to be able to run the code on different machines, some environment variables can be set. You may add them in your .bashrc or modify them just before launching an experiment.

This should be done by

export PY4CAST_ROOTDIR="/my/dir/"

You MUST export PY4CAST_ROOTDIR to make py4cast work, you can use for instance the existing SCRATCH env var:

export PY4CAST_ROOTDIR=$SCRATCH/py4cast

If PY4CAST_ROOTDIR is not exported py4cast will default to use /scratch/shared/py4cast as its root directory, leading to Exceptions if this directory does not exist or if it is not writable.

At Météo-France

When working at Météo-France, you can use either runai + Docker or Conda/Micromamba to setup a working environment. On the AI Lab cluster we recommend using runai, Conda on our HPC.

See the runai repository for installation instructions.

Install with conda

You can install a conda environment, including py4cast in editable mode, using

conda env create --file env_conda.yaml

From an exixting conda environment, you can now install manually py4cast in development mode using

conda install conda-build -n py4cast
conda develop .

or

pip install --editable .

Install with micromamba

Please install the environment using :

micromamba create -f env.yaml

From an exixting micromamba environment, you can now install manually py4cast in editable mode using

pip install --editable .

Usage

Docker and runai

For now this works only for internal Météo-France users.

Click to expand `runai` commands must be issued at the root directory of the `py4cast` project: 1. Run an interactive training session ```bash runai gpu_play 4 runai build runai exec_gpu python bin/train.py --dataset titan --model hilam ``` 2. Train using sbatch single node multi-GPUs ```bash export RUNAI_GRES="gpu:v100:4" runai sbatch python bin/train.py --dataset titan --model hilam ``` 3. Train using sbatch multi nodes multi GPUs Here we use 2 nodes with 4 GPUs each. ```bash export RUNAI_SLURM_NNODES=2 export RUNAI_GRES="gpu:v100:4" runai sbatch_multi_node python bin/train.py --dataset titan --model hilam ``` For the rest of the documentation, you must preprend each python command with `runai exec_gpu`.

Conda or Micromamba

Once your micromamba environment is setup, you should :

A very simple training can be launch (on your current node)

python bin/train.py  --dataset dummy --model halfunet --epochs 2

Example of script to launch on gpu

To do so, you will need to create a small sh script.

#!/usr/bin/bash
#SBATCH --partition=ndl
#SBATCH --nodes=1 # Specify the number of GPU node you required
#SBATCH --gres=gpu:1 # Specify the number of GPU required per Node
#SBATCH --time=05:00:00 # Specify your experiment Time limit
#SBATCH --ntasks-per-node=1 # Specify the number of task per node. This should match the number of GPU Required per Node

# Note that other variable could be set (according to your machine). For example you may need to set the number of CPU or the memory used by your experiment.
# On MF hpc, this is proportional to the number of GPU required per node. This is not the case on other machine (e.g MétéoFrance AILab machine).

source ~/.bashrc  # Be sure that all your environment variables are set
conda activate py4cast # Activate your environment (installed by micromamba or conda)
cd $PY4CAST_PATH # Go to Py4CAST (you can either add an environment variable or hard code it here).
# Launch your favorite command.
srun python bin/train.py --model halfunet --dataset dummy --epochs 2

Then just launch this script using

sbatch my_tiny_script.sh

NB Note that you may have some trouble with SSL certificates (for cartopy). You may need to explicitely export the certificate as :

 export SSL_CERT_FILE="/opt/softs/certificats/proxy1.pem"

with the proxy path depending on your machine.

Dataset configuration & simple training

As in neural-lam, before training you must first compute the mean and std of each feature.

To compute the stats of the Titan dataset:

python py4cast/datasets/titan/__init__.py

To train on a dataset with its default settings just pass the name of the dataset (all lowercase) :

python bin/train.py --dataset titan --model halfunet

You can override the dataset default configuration file:

python bin/train.py --dataset smeagol --model halfunet --dataset_conf config/smeagoldev.json

Details on available datasets.

Training options

  1. Configuring the neural network

To train on a dataset using a network with its default settings just pass the name of the architecture (all lowercase) as shown below:

python bin/train.py --dataset smeagol --model hilam

python bin/train.py --dataset smeagol --model halfunet

You can override some settings of the model using a json config file (here we increase the number of filter to 128 and use ghost modules):

python bin/train.py --dataset smeagol --model halfunet --model_conf config/halfunet128_ghost.json

Details on available neural networks.

  1. Changing the training strategy

You can choose a training strategy using the --strategy STRATEGY_NAME cli argument:

python bin/train.py --dataset smeagol --model halfunet --strategy diff_ar

Details on available training strategies.

Tracking experiment

We use Tensorboad to track the experiments. You can launch a tensorboard server using the following command:

At Météo-France:

runai will handle port forwarding for you.

runai tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Elsewhere

tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Then you can access the tensorboard server at the following address: http://YOUR_SERVER_IP:YOUR_PORT/

  1. Other training options:

You can find more details about all the num_X_steps options here.

Inference

Inference is done by running the bin/inference.py script. This script will load a model and run it on a dataset using the training parameters (dataset config, timestep options, ...).

usage: py4cast Inference script [-h] [--model_path MODEL_PATH] [--dataset DATASET] [--infer_steps INFER_STEPS] [--date DATE]

options:
  -h, --help            show this help message and exit
  --model_path MODEL_PATH
                        Path to the model checkpoint
  --date DATE
                        Date of the sample to infer on. Format:YYYYMMDDHH
  --dataset DATASET
                        Name of the dataset config file to use
  --infer_steps INFER_STEPS
                        Number of auto-regressive steps/prediction steps during the inference
   --precision PRECISION
                        floating point precision for the inference (default: 32) 

A simple example of inference is shown below:

 runai exec_gpu python bin/inference.py --model_path /scratch/shared/py4cast/logs/camp0/poesy/halfunet/sezn_run_dev_30 --date 2021061621 --dataset poesy_infer --infer_steps 2

Making animated plots comparing multiple models

You can compare multiple trained models on specific case studies and visualize the forecasts on animated plots with the bin/gif_comparison.py. See example of GIF at the beginning of the README.

Warnings:

Usage: gif_comparison.py [-h] --ckpt CKPT --date DATE [--num_pred_steps NUM_PRED_STEPS]

options:
  -h, --help            show this help message and exit
  --ckpt CKPT           Paths to the model checkpoint or AROME
  --date DATE           Date for inference. Format YYYYMMDDHH.
  --num_pred_steps NUM_PRED_STEPS
                        Number of auto-regressive steps/prediction steps.

example: python bin/gif_comparison.py --ckpt AROME --ckpt /.../logs/my_run/epoch=247.ckpt
                                      --date 2023061812 --num_pred_steps 10

Adding features and contributing

This page explains how to:

Design choices

The figure below illustrates the principal components of the Py4cast architecture.

py4cast

Ideas for future improvements