Very Slow Update After the 2nd Batch 1st Epoch

potus28 commented 11 months ago

Hello MIR group,

I'm using Allegro as well as NequIP and FLARE to build MLIPs for modeling condensed phase systems and systems for heterogeneous catalysis, and I'm having a little bit of difficulty with Allegro. On my laptop, I can build smaller Allegro models and training goes as expected. However, for larger models that I am training on Perlmutter, after training on the second batch of the first epoch, it takes a while for the 3rd batch to process, and I get the message copied below. After this message gets displayed, training continues as expected. Have you seen this issue before, and if so is there a way to fix this and make training not take so long in the beginning? I've copied the error message, my allegro config file, and my SLURM script on Perlmutter below. The SLURM script and config file are for a hyperparameter scan, and for the hyperparameters I have looked at so far they all have this issue. Any help would be much appreciated. Thanks!

Sincerely, Woody

Message that appears in training:

# Epoch batch         loss       loss_f  loss_stress       loss_e        f_mae       f_rmse     Ar_f_mae  psavg_f_mae    Ar_f_rmse psavg_f_rmse        e_mae      e/N_mae   stress_mae  stress_rmse
      0     1        0.951        0.949     1.31e-05      0.00122        0.106        0.203        0.106        0.106        0.203        0.203         1.39      0.00546     0.000341     0.000754
      0     2          0.9        0.899     4.69e-06     0.000544        0.101        0.197        0.101        0.101        0.197        0.197        0.414      0.00385     0.000281     0.000451
/global/homes/w/wnw36/.conda/envs/nequip/lib/python3.10/site-packages/torch/autograd/__init__.py:276: UserWarning: operator() profile_node %884 : int[] = prim::profile_ivalue(%882)
 does not have profile information (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
      0     3         1.21         1.21     1.24e-05     0.000595        0.114        0.229        0.114        0.114        0.229        0.229        0.652      0.00382     0.000324     0.000732

SLURM script on Perlmutter:

#!/bin/bash
#SBATCH --job-name=nequip
#SBATCH --output=nequip.o%j
#SBATCH --error=nequip.e%j
#SBATCH --nodes=1
#SBATCH --time=24:00:00
#SBATCH --constraint=gpu
#SBATCH --qos=regular
#SBATCH --exclusive

module load python
conda activate nequip

mkdir -p outputs
for rcut in 4.0 6.0; do
    for learning in 0.001 0.005; do
        for lmax in 4 5; do 
            for nfeatures in 32 64; do
                for nlayers in 4; do
                    file=gridsearch-$rcut-$learning-$lmax-$nfeatures-$nlayers.yaml
                    sed -e "s/rcrcrc/$rcut/g" -e "s/lmaxlmaxlmax/$lmax/g" -e "s/lratelratelrate/$learning/g" -e "s/nfeatnfeatnfeat/$nfeatures/g" -e "s/nlayernlayernlayer/$nlayers/g" template.yaml > $file
                    nequip-train $file > outputs/$rcut-$learning-$lmax-$nfeatures-$nlayers.log
                done
            done
        done
    done
done

Allegro config file:

run_name: 4.0-0.001-4-32-4-4

seed: 123456

dataset_seed: 123456

append: true

default_dtype: float32

model_builders:
 - allegro.model.Allegro
 - PerSpeciesRescale
 - StressForceOutput
 - RescaleEnergyEtc

r_max: 4.0

avg_num_neighbors: auto

BesselBasis_trainable: true

PolynomialCutoff_p: 6  

l_max: 4

parity: o3_full  

num_layers: 4

env_embed_multiplicity: 32

embed_initial_edge: true

two_body_latent_mlp_latent_dimensions: [128, 256, 512, 1024]
two_body_latent_mlp_nonlinearity: silu
two_body_latent_mlp_initialization: uniform

latent_mlp_latent_dimensions: [1024, 1024, 1024]

latent_mlp_nonlinearity: silu
latent_mlp_initialization: uniform
latent_resnet: true

env_embed_mlp_latent_dimensions: []

env_embed_mlp_nonlinearity: null

env_embed_mlp_initialization: uniform

edge_eng_mlp_latent_dimensions: [128]

edge_eng_mlp_nonlinearity: null

edge_eng_mlp_initialization: uniform

dataset: ase
dataset_file_name: ../trajectory/traj.xyz
ase_args:
  format: extxyz
chemical_symbol_to_type:
  Ar: 0
wandb: false
wandb_project: Ar
verbose: debug
n_train: 1300

n_val: 100

batch_size: 5
validation_batch_size: 10

max_epochs: 10
learning_rate: 0.001

train_val_split: random

shuffle: true

metrics_key: validation_loss
use_ema: true
ema_decay: 0.99

ema_use_num_updates: true

loss_coeffs:
  forces: 1.
  stress: 1.
  total_energy:          
    - 1.
    - PerAtomMSELoss

optimizer_name: Adam
optimizer_params:
  amsgrad: false
  betas: !!python/tuple
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0.

lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.5
early_stopping_upper_bounds:
  cumulative_wall: 604800.

early_stopping_lower_bounds:
  LR: 1.0e-5

early_stopping_patiences:
  validation_loss: 100

metrics_components:
  - - forces                              
    - mae                                
  - - forces
    - rmse
  - - forces
    - mae
    - PerSpecies: True                    
      report_per_component: False          
  - - forces
    - rmse
    - PerSpecies: True
      report_per_component: False
  - - total_energy
    - mae
  - - total_energy
    - mae
    - PerAtom: True                  
  - - stress
    - mae
  - - stress
    - rmse

Linux-cpp-lisp commented 11 months ago

Hi @potus28 ,

Thanks for your interest in our code!

If training continues at a reasonable speed, this is expected behavior due to TorchScript JIT compilation after 3 warmup calls. On the other hand, if you observe sustained degradation in performance, please see issue https://github.com/mir-group/nequip/discussions/311 and report the relevant details there.

potus28 commented 9 months ago

Hi @Linux-cpp-lisp, thanks! Yes, after the training continues at a reasonable speed after this call. Thanks again, and have a great day!

mir-group / allegro

Very Slow Update After the 2nd Batch 1st Epoch #64