Open samueldyoung29ctr opened 3 weeks ago
Hi @samueldyoung29ctr ,
Hm, odd. If you start a simulation in LAMMPS from exactly a frame that shows up in your training data, or the one that shows a stress tensor in nequip-evaluate
, do you get the expected result at least on that first frame?
All of the starting geometries I launch from LAMMPS appear to show -nan
pressure, even on the first step. I tried printing the virial tensor elements directly as they come from the torch model output, just after they are passed to LAMMPS.
All virial components are -nan
in the Kokkos-accelerated version, leading to the -nan
pressure.
Looks like the workaround is to use the non-Kokkos pair_allegro
for now. I am building and linking against cxx11 libtorch 2.0.0-cpuonly, as this is what NERSC's build uses. My build of LAMMPS is CPU-only due to resource constraints on the cluster I use. I do see in pair_allegro.cpp that there is an additional compute_custom_tensor
packed into the model input that doesn't exist in the Kokkos version---perhaps this has something to do with enabling virial model outputs?
Happy to do some more testing if you'd like.
My build of LAMMPS is CPU-only due to resource constraints on the cluster I use.
Aha. I don't think we've ever actually tested Kokkos pair_allegro on CPU, nor am I sure we'd expect it to have any benefits over the OpenMP pair_allegro "plain" on CPU. @anjohan thoughts?
Still, I guess we would have expected it to work...
compute_custom_tensor
should not be relevant for virials.
Also, just to clarify, we should be training Allegro models on stresses in units of energy / length^3, right? E.g., for LAMMPS metal units, we should train Allegro on stresses in units of eV/ang^3, not units of bar?
Update: I got around to compiling LAMMPS with CUDA support, but am still seeing this issue when using Kokkos to utilize the GPUs (NVIDIA A100-SXM4-40GB GPUs, CUDA 12.4 drivers installed).
My CMake config looks like this:
The Allegro model I am using was trained using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (py3.10_cuda11.3_cudnn8.2.0_0). It was also trained on a NVIDIA A100-SXM4-40GB GPU.
After deploying the best model to TorchScript format, I attempted to use it in a LAMMPS NPT simulation. The input geometry is a water-only system.
There are no atom overlaps in this geometry, and the LAMMPS input script attempts to do NPT.
I invoke LAMMPS with Kokkos:
srun --cpu-bind=cores --gpu-bind=none lmp -k on g 4 -sf kk -pk kokkos neigh full newton on -in input.lammps
This results in the following error:
If I instead change to NVT like this:
fix mynose all nvt &
temp 298.15 298.15 0.011 &
tchain 3 &
# iso 1.01325 1.01325 0.03
and again run with Kokkos the run starts, but with nan
as the pressure:
If I take this same job, again using GPU LAMMPS, and run a CPU-only job without using Kokkos:
srun --cpu-bind=cores --gpu-bind=none lmp -in input.lammps
then pressures are calculated (although they are quite high):
Any advice on what to try? I have been using LAMMPS 02Aug23 since the folks at NERSC have used that version for their LAMMPS+pair_allegro installation, but is there a different LAMMPS release you recommend using?
The admins on my cluster are also going to fix CUDA 12.4, so I should be able to build against more recent CUDA and related libraries in the next few weeks.
Thanks!
Update 13 Sep 2024: The problem and nan pressures persist even when compiling with the latest LAMMPS development branch (i.e., Git commit 2995cb7 from doing git clone --depth=1 https://github.com/lammps/lammps
).
I'm trying to do NPT calculations in LAMMPS using
pair_allegro
, but the Allegro model I trained is predicting-nan
for the system pressure, so the NPT run fails. If I run under NVE or NVT, including thepress
property in the LAMMPS thermo logging, I see output like this:If I attempt a NPT calculation, like this
I immediately get this error:
I am training on a ASE dataset with stress information stored in units of energy / length^2:
and I can confirm that when using
nequip-evaluate
for inference on this same deployed model, I do get predicted stresses in the output file.This problem happens no matter whether I use
default_dtype: float32
in the config (andpair_style allegro3232
in the LAMMPS script) ordefault_dtype: float64
in the config (andpair_style allegro
in the LAMMPS script). I am using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (CUDA 11.3, cuDNN 8.2.0). I compiled LAMMPS 02Aug23 using pair_allegro commit 20538c9, which is the commit introducing support for stress. Details of my compilation of LAMMPS, an example of my training config (except thedefault_dtype
setting), and an example NVT LAMMPS input script are here.I am forcing deletion of any overlapping atoms prior to the NPT run, and do not see any indication when running under NVT or NVE that atoms are too close together, have very high forces, or are otherwise causing the simulation to go unstable. If I switch my LAMMPS input to use
pair_style lj/cut
, I am able to observe pressures in the thermo outputIs there something obvious I'm missing about how to get pair_allegro to pass the stress predictions from my Allegro models to LAMMPS?
Thanks!