Open nec4 opened 4 months ago
Hi,
This is certainly not the intended behavior and a bug that we'll fix. I think it should work already if using Kokkos.
You may, however, also want to avoid having completely empty domains in your simulations. LAMMPS has functionality for resizing domains using balance
(statically) or fix balance
(periodically), see the LAMMPS documentation. If you're brave, you can try this in combination with comm_style tiled
.
Thanks for the response! Of course, it was just something I noticed that happened rarely in a vacuum simulation :) . I will check out those LAMMPS options.
I'm experiencing this behavior with the latest multicut
branch from pair_allegro on CPU LAMMPS 02 August 2023, update 3. I did a couple epochs of training on the aspirin dataset (https://github.com/mir-group/allegro/blob/main/configs/example.yaml) and was trying to test that force field on geometry formed from the first frame of the aspirin dataset.
Even when running with a single MPI process (which I would think would prevent empty domains), I'm still getting this "cannot reshape tensor" error. Any ideas on what might be happening?
Hi @samueldyoung29ctr ,
This should be fixed now but on main
, we've since cleaned up and merged everything down. If you can confirm, I will close this issue. Thanks!
@Linux-cpp-lisp, thanks for the tip. I think I actually had a bad timestep and/or starting geometry. After fixing things, my simple Allegro model appears to evaluate for pair_allegro
compiled both from the multicut
and main
branches.
Great! Glad it's resolved. (You can probably stay on main then, since it is the latest.)
Hi @Linux-cpp-lisp, I'm running into this error again and am looking for some more guidance. This time, I tried a simpler system of water in a box. I trained an Allegro model on this dataset (dataset_1593.xyz
, which I converted to ASE .traj
format) from Cheng, et al. "Ab initio thermodynamics of liquid and solid water". I used a ~90/10 train/validation split, lmax = 2, and altered some of the default numbers of MLP layers. This dataset has some water molecules outside of the simulation box, so I trained both with and without first wrapping the atoms inside the box.
The Conda environment I am using to train has NequIP 0.6.0 and Allegro symlinked as a development package at commit 22f673c.
Training was done on Nvidia A100 GPUs. The training converges quickly since this dataset doesn't have very large forces on the atoms. I then deployed the models to standalone format:
nequip-deploy build --train-dir "<path to train dir of model for unwrapped data>" model-unwrapped_08-aug-2024.pth
nequip-deploy build --train-dir "<path to train dir of model for wrapped data>" model-wrapped_08-aug-2024.pth
I then used the standalone model files to run LAMMPS jobs on a different cluster where I have more computer time. I compiled LAMMPS for CPU with pair_allegro and Kokkos (a combination which apparently is not available yet on NERSC). I used the 02Aug2023 version of LAMMPS and, based on your previous advice, patched it with pair_allegro commit 20538c9, which is the current main
and one commit later than the patch (89e3ce1) that was supposed to fix empty domains in non-Kokkos simulations. I compiled LAMMPS with GCC 10 compilers, FFTW3 from AMD AOCL 4.0 (compiled with GCC compilers), GNU MPICH 3.3.2, and also exposed Intel oneAPI 2024.1 MKL libraries to satisfy the cmake preprocessing step. I linked against libtorch 2.0.0 (CPU, CXX11 ABI, available here) based on the advice of NERSC consultants.
I set up several LAMMPS jobs using as initial geometries four randomly selected frames of that dataset_1593.xyz
dataset and my two trained Allegro models (model-wrapped_08-aug-2024.pth
and model-unwrapped_08-aug-2024.pth
).
I ran these 8 jobs both with and without Kokkos, like this:
Nearly all cases results in the Torch reshape error, after the simulation has proceeded for some number of steps. (The one case where it has not is a segfault.)
Last MD or minimization step | Torch reshape error? | |
---|---|---|
model-wrapped_08-aug-2024.pth-frame607-kokkos | 321 | True |
model-wrapped_08-aug-2024.pth-frame607 | 1741 | True |
model-wrapped_08-aug-2024.pth-frame596-kokkos | 31 | True |
model-wrapped_08-aug-2024.pth-frame596 | 454 | True |
model-wrapped_08-aug-2024.pth-frame1351-kokkos | 270 | True |
model-wrapped_08-aug-2024.pth-frame1351 | 616 | True |
model-wrapped_08-aug-2024.pth-frame1252-kokkos | 179 | True |
model-wrapped_08-aug-2024.pth-frame1252 | 1497 | True |
model-unwrapped_08-aug-2024.pth-frame607-kokkos | 330 | True |
model-unwrapped_08-aug-2024.pth-frame607 | 271 | True |
model-unwrapped_08-aug-2024.pth-frame596-kokkos | 31 | True |
model-unwrapped_08-aug-2024.pth-frame596 | 1744 | True |
model-unwrapped_08-aug-2024.pth-frame1351-kokkos | 290 | False |
model-unwrapped_08-aug-2024.pth-frame1351 | 754 | True |
model-unwrapped_08-aug-2024.pth-frame1252-kokkos | 182 | True |
model-unwrapped_08-aug-2024.pth-frame1252 | 1734 | True |
Additionally, I examined the domain decomposition chosen for one typical job, using the X, Y, and Z cut locations printed to screen by the LAMMPS balance
command to count how many atoms were in domain at each frame of the simulation. While the precision of these cut points undoubtedly causes some rounding error, I was surprised to find that 32 of the 128 domains were empty at the first frame of MD simulation. So it may be more complex than a domain simply going empty at a later point in the simulation.
However, my limited testing does support the idea that empty domains make the simulation more likely to crash from a Torch reshape error. On another system I'm researching with ~500 atoms (water solvated transition metal atoms), I see approximate inverse proportionality in the number of steps my MD simulation will complete before a Torch reshape crash and the number of domains:
This happens even despite the fact that using fewer domains for some reason tends to produce different results for the pre-MD conjugate gradient minimization. 16 MPI tasks produce a minimized geometry where atoms concentrated in one half of the square box, while larger numbers of MPI tasks produce more uniformly distributed geometry, so the fact that the 16-task job does not produce a Torch reshape error even at O(5000) steps makes empty domains seem a more likely cause of this error.
I'm not sure what else to try. I've tried things like forcing a domain rebalance after each MD step and increasing the neighbor list and ghost atom communication cutoffs, but I'm still encountering Torch reshape errors for all but the smallest number of domains.
Do you have any guidance on what to try, or tests you'd like me to run?
Thanks!
Edit: running larger systems with the same number of domain decompositions seems to work. The water system above is 64 waters in a box. If I instead replicate to 3 copies in each dimension (1728 waters), I can run 20k steps at 128 MPI tasks with no Torch reshape error, both with and without Kokkos.
Hello all,
Very happy with the tools - thank you for maintaining them and integrating with LAMMPS. I am running some vacuum simulations and while increasing the number of GPUs (and mpi ranks), I ran into the following issue:
Seems that its related to the way LAMMPS partitions the simulation box with increasing number of processes:
https://docs.lammps.org/Developer_par_part.html
Where if the rank/partition running that model contains no atoms, the network potential cannot accept the zero size tensor. Is this the intended behaviour of allegro models in LAMMPS?
Happy to run some more tests/provide more information if needed.