allegro_pair style and empty partitions

nec4 commented 4 months ago

Hello all,

Very happy with the tools - thank you for maintaining them and integrating with LAMMPS. I am running some vacuum simulations and while increasing the number of GPUs (and mpi ranks), I ran into the following issue:

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/nequip/nn/_grad_output.py", line 32, in forward
        _6 = torch.append(wrt_tensors, data[k])
      func0 = self.func
      data0 = (func0).forward(data, )
               ~~~~~~~~~~~~~~ <--- HERE
      of = self.of
      _7 = [torch.sum(data0[of])]
  File "code/__torch__/nequip/nn/_graph_mixin.py", line 30, in AD_logsumexp_backward
    input1 = (radial_basis).forward(input0, )
    input2 = (spharm).forward(input1, )
    input3 = (allegro).forward(input2, )
              ~~~~~~~~~~~~~~~~ <--- HERE
    input4 = (edge_eng).forward(input3, )
    input5 = (edge_eng_sum).forward(input4, )
  File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0
    _n_scalar_outs = self._n_scalar_outs
    _38 = torch.slice(_37, 2, None, _n_scalar_outs[0])
    scalars = torch.reshape(_38, [(torch.size(features1))[0], -1])
              ~~~~~~~~~~~~~ <--- HERE
    features2 = (_04).forward(features1, )
    _39 = annotate(List[Optional[Tensor]], [active_edges0])

Traceback of TorchScript, original code (most recent call last):
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_grad_output.py", line 84, in forward
            wrt_tensors.append(data[k])
        # run func
        data = self.func(data)
               ~~~~~~~~~ <--- HERE
        # Get grads
        grads = torch.autograd.grad(
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_logsumexp_backward
    def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type:
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/allegro/nn/_allegro.py", line 585, in mean_0
            # features has shape [z][mul][k]
            # we know scalars are first
            scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                features.shape[0], -1
            )
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Seems that its related to the way LAMMPS partitions the simulation box with increasing number of processes:

https://docs.lammps.org/Developer_par_part.html

Where if the rank/partition running that model contains no atoms, the network potential cannot accept the zero size tensor. Is this the intended behaviour of allegro models in LAMMPS?

Happy to run some more tests/provide more information if needed.

anjohan commented 4 months ago

Hi,

This is certainly not the intended behavior and a bug that we'll fix. I think it should work already if using Kokkos.

You may, however, also want to avoid having completely empty domains in your simulations. LAMMPS has functionality for resizing domains using balance (statically) or fix balance (periodically), see the LAMMPS documentation. If you're brave, you can try this in combination with comm_style tiled.

nec4 commented 3 months ago

Thanks for the response! Of course, it was just something I noticed that happened rarely in a vacuum simulation :) . I will check out those LAMMPS options.

samueldyoung29ctr commented 2 months ago

I'm experiencing this behavior with the latest multicut branch from pair_allegro on CPU LAMMPS 02 August 2023, update 3. I did a couple epochs of training on the aspirin dataset (https://github.com/mir-group/allegro/blob/main/configs/example.yaml) and was trying to test that force field on geometry formed from the first frame of the aspirin dataset.

Even when running with a single MPI process (which I would think would prevent empty domains), I'm still getting this "cannot reshape tensor" error. Any ideas on what might be happening?

Packages used to build pair_allegro

- slurm (pmi) - scl/gcc-toolset-10 (GCC 10 compilers) - intel/mkl/2024.1 (to be able to compile against libtorch, but not used for FFTs) - intel/compiler-rt/2024.1.0 (also seem to be needed to use libtorch) - mpich/gnu/3.3.2 (known working MPI implementation) - libtorch/2.0.0+cpu-cxx11-abi (known working version of libtorch on NERSC Perlmutter's lammps_allegro image) - amd/aocl/gcc/4.0 (Using AOCL-provided FFTW3 libraries for FFTs, built with GCC)

cmake command to build LAMMPS after patching with pair_allegro "multicut" branch

```bash cmake ../cmake \ -D CMAKE_BUILD_TYPE=Debug \ -D LAMMPS_EXCEPTIONS=ON \ -D BUILD_SHARED_LIBS=ON \ -D BUILD_MPI=yes \ -D BUILD_OMP=yes \ -C ../cmake/presets/kokkos-openmp.cmake \ -D PKG_KOKKOS=yes \ -D Kokkos_ARCH_ZEN3=yes \ -D BUILD_TOOLS=no \ -D FFT=FFTW3 \ -D FFT_KOKKOS=FFT3W \ -D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \ -D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \ -D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \ -D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \ -D PKG_MANYBODY=yes \ -D PKG_MOLECULE=yes \ -D PKG_KSPACE=yes \ -D PKG_REPLICA=yes \ -D PKG_ASPHERE=yes \ -D PKG_RIGID=yes \ -D PKG_MPIIO=yes \ -D PKG_COMPRESS=yes \ -D PKG_H5MD=no \ -D PKG_OPENMP=yes \ -D CMAKE_POSITION_INDEPENDENT_CODE=yes \ -D CMAKE_EXE_FLAGS="-dynamic" \ -D FFT_FFTW_THREADS=on ``` `$LAMMPS_ROOT` is the custom prefix where I `make install` LAMMPS after compilation. `$AOCL_ROOT` and `$AOCL_LIB` are the installation location and library location of AOCL 4.0 on my system.

Input geometry (aspirin-with-topo.data)

```plain LAMMPS data file. CGCMM style. atom_style full generated by VMD/TopoTools v1.8 on Wed Jul 10 16:22:01 -0400 2024 21 atoms 21 bonds 0 angles 0 dihedrals 0 impropers 3 atom types 1 bond types 0 angle types 0 dihedral types 0 improper types -5.163345 4.836655 xlo xhi -5.155887 4.844113 ylo yhi -5.014088 4.985912 zlo zhi # Pair Coeffs # # 1 C # 2 H # 3 O # Bond Coeffs # # 1 Masses 1 12.010700 # C 2 1.007940 # H 3 15.999400 # O Atoms # full 1 1 1 0.000000 2.134489 -0.984361 -0.195218 # C 2 1 1 0.000000 0.762644 0.959414 -1.679929 # C 3 1 1 0.000000 2.660345 -0.407926 -1.307304 # C 4 1 1 0.000000 1.910317 0.393966 -2.147020 # C 5 1 1 0.000000 -3.030190 1.495405 0.719662 # C 6 1 1 0.000000 0.849425 -0.550871 0.284375 # C 7 1 1 0.000000 0.238447 0.473506 -0.404422 # C 8 1 3 0.000000 0.897896 -2.276432 1.730061 # O 9 1 3 0.000000 -2.383452 0.417779 -1.462857 # O 10 1 3 0.000000 -0.476201 -0.529087 2.339259 # O 11 1 1 0.000000 0.392992 -1.190237 1.537982 # C 12 1 1 0.000000 -2.122985 0.951760 -0.397705 # C 13 1 3 0.000000 -0.804666 1.286246 0.110509 # O 14 1 2 0.000000 -0.493803 -1.186792 3.095974 # H 15 1 2 0.000000 2.554735 -1.802497 0.392131 # H 16 1 2 0.000000 0.330690 1.855711 -2.345264 # H 17 1 2 0.000000 3.803794 -0.493719 -1.456203 # H 18 1 2 0.000000 2.231141 0.557186 -3.124150 # H 19 1 2 0.000000 -2.708930 2.484658 0.926928 # H 20 1 2 0.000000 -4.130483 1.482167 0.431266 # H 21 1 2 0.000000 -2.874148 1.003209 1.699485 # H Bonds 1 1 1 3 2 1 1 15 3 1 1 6 4 1 2 4 5 1 2 16 6 1 2 7 7 1 3 4 8 1 3 17 9 1 4 18 10 1 5 12 11 1 5 20 12 1 5 19 13 1 5 21 14 1 6 7 15 1 6 11 16 1 7 13 17 1 8 11 18 1 9 12 19 1 10 11 20 1 10 14 21 1 12 13 ```

LAMMPS input file (input.lammps)

```bash # PART A - ENERGY MINIMIZATION # 1) Initialization units metal dimension 3 atom_style full boundary p p p # 2) System definition read_data aspirin-with-topo.data # 3) Simulation settings pair_style allegro3232 # Accidentally trained with float32 dtypes pair_coeff * * aspirin-model_11-jul-2024.pth C H O # 4) Visualization thermo 1 thermo_style custom step time temp pe ke etotal epair ebond econserve fmax # Also want to dump the CG minimization trajectory dump mintraj all atom 1 minimization.lammpstrj # 5) Run minimize 1.0e-4 1.0e-6 1000 10000 undump mintraj # PART B - MOLECULAR DYNAMICS delete_atoms overlap 0.1 all all # Logging thermo 10 # Try to rebuild neighbor lists more often neigh_modify every 1 delay 0 check yes # Run MD fix mynve all nve fix mylgv all langevin 1.0 1.0 0.1 1530917 # Be sure to dump the MD trajectory dump mdtraj all atom 1 mdtraj.lammpstrj dump mdforces all custom 1 mdforces.lammpstrj x y z vx vy vz fx fy fz timestep 0.5 run 1000 undump mdtraj ```

Command to run LAMMPS

```bash srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps ```

LAMMPS output

``` LAMMPS (2 Aug 2023 - Update 3) KOKKOS mode with Kokkos version 3.7.2 is enabled (src/KOKKOS/kokkos.cpp:108) using 1 OpenMP thread(s) per MPI task package kokkos package kokkos neigh full # PART A - ENERGY MINIMIZATION # 1) Initialization units metal dimension 3 atom_style full boundary p p p # 2) System definition read_data aspirin-with-topo.data Reading data file ... orthogonal box = (-5.163345 -5.155887 -5.014088) to (4.836655 4.844113 4.985912) 1 by 1 by 1 MPI processor grid reading atoms ... 21 atoms scanning bonds ... 4 = max bonds/atom reading bonds ... 21 bonds Finding 1-2 1-3 1-4 neighbors ... special bond factors lj: 0 0 0 special bond factors coul: 0 0 0 4 = max # of 1-2 neighbors 6 = max # of 1-3 neighbors 13 = max # of 1-4 neighbors 15 = max # of special neighbors special bonds CPU = 0.004 seconds read_data CPU = 0.018 seconds # 3) Simulation settings pair_style allegro3232 pair_coeff * * aspirin-model_11-jul-2024.pth C H O # 4) Visualization thermo 1 thermo_style custom step time temp pe ke etotal epair ebond econserve fmax # Also want to dump the CG minimization trajectory dump mintraj all atom 1 minimization.lammpstrj # 5) Run minimize 1.0e-4 1.0e-6 1000 10000 WARNING: Using a manybody potential with bonds/angles/dihedrals and special_bond exclusions (src/pair.cpp:242) WARNING: Bonds are defined but no bond style is set (src/force.cpp:193) WARNING: Likewise 1-2 special neighbor interactions != 1.0 (src/force.cpp:195) Neighbor list info ... update: every = 1 steps, delay = 0 steps, check = yes max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 8 ghost atom cutoff = 8 binsize = 4, bins = 3 3 3 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair allegro3232/kk, perpetual attributes: full, newton on, kokkos_device pair build: full/bin/kk/device stencil: full/bin/3d bin: kk/device Per MPI rank memory allocation (min/avg/max) = 8.928 | 8.928 | 8.928 Mbytes Step Time Temp PotEng KinEng TotEng E_pair E_bond Econserve Fmax 0 0 0 -405675.22 0 -405675.22 -405675.22 0 -405675.22 13.409355 1 0.001 0 -405679.39 0 -405679.39 -405679.39 0 -405679.39 6.1880095 Loop time of 4.11094 on 1 procs for 1 steps with 21 atoms 96.3% CPU use with 1 MPI tasks x 1 OpenMP threads Minimization stats: Stopping criterion = energy tolerance Energy initial, next-to-last, final = -405675.21875 -405675.21875 -405679.388671875 Force two-norm initial, final = 27.207462 16.492158 Force max component initial, final = 13.409355 6.1880095 Final line search alpha, max atom move = 0.0074574801 0.046146958 Iterations, force evaluations = 1 1 MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 4.1102 | 4.1102 | 4.1102 | 0.0 | 99.98 Bond | 7.06e-06 | 7.06e-06 | 7.06e-06 | 0.0 | 0.00 Neigh | 0 | 0 | 0 | 0.0 | 0.00 Comm | 4.4729e-05 | 4.4729e-05 | 4.4729e-05 | 0.0 | 0.00 Output | 0 | 0 | 0 | 0.0 | 0.00 Modify | 0 | 0 | 0 | 0.0 | 0.00 Other | | 0.0006423 | | | 0.02 Nlocal: 21 ave 21 max 21 min Histogram: 1 0 0 0 0 0 0 0 0 0 Nghost: 510 ave 510 max 510 min Histogram: 1 0 0 0 0 0 0 0 0 0 Neighs: 0 ave 0 max 0 min Histogram: 1 0 0 0 0 0 0 0 0 0 FullNghs: 596 ave 596 max 596 min Histogram: 1 0 0 0 0 0 0 0 0 0 Total # of neighbors = 596 Ave neighs/atom = 28.380952 Ave special neighs/atom = 8.5714286 Neighbor list builds = 0 Dangerous builds = 0 ```

LAMMPS stderr

``` Exception: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/nequip/nn/_graph_model.py", line 29, in forward pass model = self.model return (model).forward(new_data, ) ~~~~~~~~~~~~~~ <--- HERE File "code/__torch__/nequip/nn/_rescale.py", line 21, in batch_norm data: Dict[str, Tensor]) -> Dict[str, Tensor]: model = self.model data0 = (model).forward(data, ) ~~~~~~~~~~~~~~ <--- HERE training = self.training if training: File "code/__torch__/nequip/nn/_grad_output.py", line 71, in layer_norm pass func = self.func data0 = (func).forward(data, ) ~~~~~~~~~~~~~ <--- HERE _17 = [torch.sum(data0["total_energy"])] _18 = [pos, data0["_displacement"]] File "code/__torch__/nequip/nn/_graph_mixin.py", line 28, in AD_sum_backward input1 = (radial_basis).forward(input0, ) input2 = (spharm).forward(input1, ) input3 = (allegro).forward(input2, ) ~~~~~~~~~~~~~~~~ <--- HERE input4 = (edge_eng).forward(input3, ) input5 = (edge_eng_sum).forward(input4, ) File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0 _n_scalar_outs = self._n_scalar_outs _38 = torch.slice(_37, 2, None, _n_scalar_outs[0]) scalars = torch.reshape(_38, [(torch.size(features1))[0], -1]) ~~~~~~~~~~~~~ <--- HERE features2 = (_04).forward(features1, ) _39 = annotate(List[Optional[Tensor]], [active_edges0]) Traceback of TorchScript, original code (most recent call last): File "/lib/python3.10/site-packages/nequip/nn/_graph_model.py", line 112, in forward new_data[k] = v # run the model data = self.model(new_data) ~~~~~~~~~~ <--- HERE return data File "/lib/python3.10/site-packages/nequip/nn/_rescale.py", line 144, in batch_norm def forward(self, data: AtomicDataDict.Type) -> AtomicDataDict.Type: data = self.model(data) ~~~~~~~~~~ <--- HERE if self.training: # no scaling, but still need to promote for consistent dtype behavior File "/lib/python3.10/site-packages/nequip/nn/_grad_output.py", line 305, in layer_norm # Call model and get gradients data = self.func(data) ~~~~~~~~~ <--- HERE grads = torch.autograd.grad( File "/lib/python3.10/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_sum_backward def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type: for module in self: input = module(input) ~~~~~~ <--- HERE return input File "/lib/python3.10/site-packages/allegro/nn/_allegro.py", line 585, in mean_0 # features has shape [z][mul][k] # we know scalars are first scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE features.shape[0], -1 ) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 slurmstepd: error: *** STEP 3194616.0 ON CANCELLED AT 2024-07-11T20:47:53 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: : task 0: Killed ```

Linux-cpp-lisp commented 2 months ago

Hi @samueldyoung29ctr ,

This should be fixed now but on main, we've since cleaned up and merged everything down. If you can confirm, I will close this issue. Thanks!

samueldyoung29ctr commented 2 months ago

@Linux-cpp-lisp, thanks for the tip. I think I actually had a bad timestep and/or starting geometry. After fixing things, my simple Allegro model appears to evaluate for pair_allegro compiled both from the multicut and main branches.

Linux-cpp-lisp commented 2 months ago

Great! Glad it's resolved. (You can probably stay on main then, since it is the latest.)

samueldyoung29ctr commented 1 month ago

Hi @Linux-cpp-lisp, I'm running into this error again and am looking for some more guidance. This time, I tried a simpler system of water in a box. I trained an Allegro model on this dataset (dataset_1593.xyz, which I converted to ASE .traj format) from Cheng, et al. "Ab initio thermodynamics of liquid and solid water". I used a ~90/10 train/validation split, lmax = 2, and altered some of the default numbers of MLP layers. This dataset has some water molecules outside of the simulation box, so I trained both with and without first wrapping the atoms inside the box.

NequIP configuration for unwrapped dataset.

The configuration for the wrapped dataset differs only in the `dataset_file_name` and `run_name`, which reflect the wrapped version of the dataset I used. ```yaml BesselBasis_trainable: false PolynomialCutoff_p: 48 append: true ase_args: format: traj avg_num_neighbors: auto batch_size: 1 chemical_symbols: - H - O dataset: ase dataset_file_name: ./data.traj dataset_seed: 123456 default_dtype: float32 early_stopping_lower_bounds: LR: 1.0e-05 early_stopping_patiences: validation_loss: 1000 early_stopping_upper_bounds: cumulative_wall: 604800.0 edge_eng_mlp_initialization: uniform edge_eng_mlp_latent_dimensions: - 32 edge_eng_mlp_nonlinearity: null ema_decay: 0.99 ema_use_num_updates: true embed_initial_edge: true env_embed_mlp_initialization: uniform env_embed_mlp_latent_dimensions: [] env_embed_mlp_nonlinearity: null env_embed_multiplicity: 64 l_max: 2 latent_mlp_initialization: uniform latent_mlp_latent_dimensions: - 64 - 64 - 64 - 64 latent_mlp_nonlinearity: silu latent_resnet: true learning_rate: 0.001 loss_coeffs: forces: 1.0 total_energy: - 1.0 - PerAtomMSELoss lr_scheduler_factor: 0.5 lr_scheduler_name: ReduceLROnPlateau lr_scheduler_patience: 25 max_epochs: 1000000 metrics_key: validation_loss model_builders: - allegro.model.Allegro - PerSpeciesRescale - StressForceOutput - RescaleEnergyEtc n_train: 1434 n_val: 159 num_layers: 1 optimizer_name: Adam optimizer_params: amsgrad: false betas: !!python/tuple - 0.9 - 0.999 eps: 1.0e-08 weight_decay: 0.0 parity: o3_full r_max: 4.0 root: run_name: seed: 123456 shuffle: true train_val_split: random two_body_latent_mlp_initialization: uniform two_body_latent_mlp_latent_dimensions: - 32 - 64 two_body_latent_mlp_nonlinearity: silu use_ema: true verbose: debug wandb: true wandb_project: ```

The Conda environment I am using to train has NequIP 0.6.0 and Allegro symlinked as a development package at commit 22f673c.

Conda environment used to train Allegro model

```plain # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge appdirs 1.4.4 pyh9f0ad1d_0 conda-forge ase 3.23.0 pyhd8ed1ab_0 conda-forge asttokens 2.4.1 pyhd8ed1ab_0 conda-forge blinker 1.8.2 pyhd8ed1ab_0 conda-forge brotli 1.1.0 hd590300_1 conda-forge brotli-bin 1.1.0 hd590300_1 conda-forge brotli-python 1.1.0 py310hc6cd4ac_1 conda-forge bzip2 1.0.8 h4bc722e_7 conda-forge ca-certificates 2024.7.4 hbcca054_0 conda-forge certifi 2024.7.4 pyhd8ed1ab_0 conda-forge cffi 1.16.0 py310h2fee648_0 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge click 8.1.7 unix_pyh707e725_0 conda-forge colorama 0.4.6 pyhd8ed1ab_0 conda-forge comm 0.2.2 pyhd8ed1ab_0 conda-forge contourpy 1.2.1 py310hd41b1e2_0 conda-forge cuda-version 11.8 h70ddcb2_3 conda-forge cudatoolkit 11.8.0 h4ba93d1_13 conda-forge cudnn 8.9.7.29 hbc23b4c_3 conda-forge cycler 0.12.1 pyhd8ed1ab_0 conda-forge debugpy 1.8.2 py310h76e45a6_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge docker-pycreds 0.4.0 py_0 conda-forge e3nn 0.4.4 pyhd8ed1ab_1 conda-forge exceptiongroup 1.2.0 pyhd8ed1ab_2 conda-forge executing 2.0.1 pyhd8ed1ab_0 conda-forge flask 3.0.3 pyhd8ed1ab_0 conda-forge fonttools 4.53.0 py310hc51659f_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge gitdb 4.0.11 pyhd8ed1ab_0 conda-forge gitpython 3.1.43 pyhd8ed1ab_0 conda-forge gmp 6.3.0 hac33072_2 conda-forge gmpy2 2.1.5 py310hc7909c9_1 conda-forge h2 4.1.0 pyhd8ed1ab_0 conda-forge hpack 4.0.0 pyh9f0ad1d_0 conda-forge hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge icu 73.2 h59595ed_0 conda-forge idna 3.7 pyhd8ed1ab_0 conda-forge importlib-metadata 8.0.0 pyha770c72_0 conda-forge importlib_metadata 8.0.0 hd8ed1ab_0 conda-forge ipykernel 6.29.5 pyh3099207_0 conda-forge ipython 8.26.0 pyh707e725_0 conda-forge itsdangerous 2.2.0 pyhd8ed1ab_0 conda-forge jedi 0.19.1 pyhd8ed1ab_0 conda-forge jinja2 3.1.4 pyhd8ed1ab_0 conda-forge joblib 1.4.2 pyhd8ed1ab_0 conda-forge jupyter_client 8.6.2 pyhd8ed1ab_0 conda-forge jupyter_core 5.7.2 py310hff52083_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.5 py310hd41b1e2_1 conda-forge krb5 1.21.3 h659f571_0 conda-forge lcms2 2.16 hb7c19ff_0 conda-forge ld_impl_linux-64 2.40 hf3520f5_7 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libblas 3.9.0 16_linux64_mkl conda-forge libbrotlicommon 1.1.0 hd590300_1 conda-forge libbrotlidec 1.1.0 hd590300_1 conda-forge libbrotlienc 1.1.0 hd590300_1 conda-forge libcblas 3.9.0 16_linux64_mkl conda-forge libdeflate 1.20 hd590300_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 14.1.0 h77fa898_0 conda-forge libgfortran-ng 14.1.0 h69a702a_0 conda-forge libgfortran5 14.1.0 hc5f4f2c_0 conda-forge libhwloc 2.11.0 default_h5622ce7_1000 conda-forge libiconv 1.17 hd590300_2 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 16_linux64_mkl conda-forge libnsl 2.0.1 hd590300_0 conda-forge libopenblas 0.3.27 pthreads_hac2b453_1 conda-forge libpng 1.6.43 h2797004_0 conda-forge libprotobuf 3.20.3 h3eb15da_0 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsqlite 3.46.0 hde9e2c9_0 conda-forge libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge libtiff 4.6.0 h1dd3fc0_3 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.16 hd590300_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxml2 2.12.7 h4c95cb1_3 conda-forge libzlib 1.3.1 h4ab18f5_1 conda-forge llvm-openmp 18.1.8 hf5423f3_0 conda-forge magma 2.5.4 hc72dce7_4 conda-forge markupsafe 2.1.5 py310h2372a71_0 conda-forge matplotlib-base 3.8.4 py310hef631a5_2 conda-forge matplotlib-inline 0.1.7 pyhd8ed1ab_0 conda-forge mir-allegro 0.2.0 dev_0 mkl 2022.2.1 h6508926_16999 conda-forge mpc 1.3.1 hfe3b2da_0 conda-forge mpfr 4.2.1 h9458935_1 conda-forge mpmath 1.3.0 pyhd8ed1ab_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge nccl 2.22.3.1 hee583db_0 conda-forge ncurses 6.5 h59595ed_0 conda-forge nequip 0.6.0 dev_0 nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge ninja 1.12.1 h297d8ca_0 conda-forge numpy 1.26.4 py310hb13e2d6_0 conda-forge openjpeg 2.5.2 h488ebb8_0 conda-forge openssl 3.3.1 h4ab18f5_1 conda-forge opt-einsum 3.3.0 hd8ed1ab_2 conda-forge opt_einsum 3.3.0 pyhc1e730c_2 conda-forge opt_einsum_fx 0.1.4 pyhd8ed1ab_0 conda-forge packaging 24.1 pyhd8ed1ab_0 conda-forge parso 0.8.4 pyhd8ed1ab_0 conda-forge pathtools 0.1.2 py_1 conda-forge pexpect 4.9.0 pyhd8ed1ab_0 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 10.4.0 py310hebfe307_0 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.47 pyha770c72_0 conda-forge protobuf 3.20.3 py310heca2aa9_1 conda-forge psutil 6.0.0 py310hc51659f_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge pycparser 2.22 pyhd8ed1ab_0 conda-forge pygments 2.18.0 pyhd8ed1ab_0 conda-forge pyparsing 3.1.2 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 py310hff52083_5 conda-forge python 3.10.14 hd12c33a_0_cpython conda-forge python-dateutil 2.9.0post0 py310h06a4308_2 python_abi 3.10 4_cp310 conda-forge pytorch 1.11.0 cuda112py310h51fe464_202 conda-forge pyyaml 6.0.1 py310h2372a71_1 conda-forge pyzmq 26.0.3 py310h6883aea_0 conda-forge readline 8.2 h8228510_1 conda-forge requests 2.32.3 pyhd8ed1ab_0 conda-forge scikit-learn 1.0.1 py310h1246948_3 conda-forge scipy 1.14.0 py310h93e2701_1 conda-forge sentry-sdk 2.10.0 pyhd8ed1ab_0 conda-forge setproctitle 1.3.3 py310h2372a71_0 conda-forge setuptools 70.1.1 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sleef 3.5.1 h9b69904_2 conda-forge smmap 5.0.0 pyhd8ed1ab_0 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge sympy 1.12.1 pypyh2585a3b_103 conda-forge tbb 2021.12.0 h434a139_2 conda-forge threadpoolctl 3.5.0 pyhc1e730c_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge torch-ema 0.3 pyhd8ed1ab_0 conda-forge torch-runstats 0.2.0 pyhd8ed1ab_0 conda-forge tornado 6.4.1 py310hc51659f_0 conda-forge tqdm 4.66.4 pyhd8ed1ab_0 conda-forge traitlets 5.14.3 pyhd8ed1ab_0 conda-forge typing-extensions 4.12.2 hd8ed1ab_0 conda-forge typing_extensions 4.12.2 pyha770c72_0 conda-forge tzdata 2024a h0c530f3_0 conda-forge unicodedata2 15.1.0 py310h2372a71_0 conda-forge urllib3 2.2.2 pyhd8ed1ab_1 conda-forge wandb 0.16.6 pyhd8ed1ab_0 conda-forge wcwidth 0.2.13 pyhd8ed1ab_0 conda-forge werkzeug 3.0.3 pyhd8ed1ab_0 conda-forge wheel 0.43.0 pyhd8ed1ab_1 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge zeromq 4.3.5 h75354e8_4 conda-forge zipp 3.19.2 pyhd8ed1ab_0 conda-forge zstandard 0.23.0 py310h64cae3c_0 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge ```

Training was done on Nvidia A100 GPUs. The training converges quickly since this dataset doesn't have very large forces on the atoms. I then deployed the models to standalone format:

nequip-deploy build --train-dir "<path to train dir of model for unwrapped data>" model-unwrapped_08-aug-2024.pth

nequip-deploy build --train-dir "<path to train dir of model for wrapped data>" model-wrapped_08-aug-2024.pth

I then used the standalone model files to run LAMMPS jobs on a different cluster where I have more computer time. I compiled LAMMPS for CPU with pair_allegro and Kokkos (a combination which apparently is not available yet on NERSC). I used the 02Aug2023 version of LAMMPS and, based on your previous advice, patched it with pair_allegro commit 20538c9, which is the current main and one commit later than the patch (89e3ce1) that was supposed to fix empty domains in non-Kokkos simulations. I compiled LAMMPS with GCC 10 compilers, FFTW3 from AMD AOCL 4.0 (compiled with GCC compilers), GNU MPICH 3.3.2, and also exposed Intel oneAPI 2024.1 MKL libraries to satisfy the cmake preprocessing step. I linked against libtorch 2.0.0 (CPU, CXX11 ABI, available here) based on the advice of NERSC consultants.

My CMake generation command

```bash cmake ../cmake \ -D CMAKE_BUILD_TYPE=Debug \ -D LAMMPS_EXCEPTIONS=ON \ -D BUILD_SHARED_LIBS=ON \ -D BUILD_MPI=yes \ -D BUILD_OMP=yes \ -C ../cmake/presets/kokkos-openmp.cmake \ -D PKG_KOKKOS=yes \ -D Kokkos_ARCH_ZEN3=yes \ -D BUILD_TOOLS=no \ -D FFT=FFTW3 \ -D FFT_KOKKOS=FFT3W \ -D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \ -D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \ -D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \ -D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \ -D PKG_MANYBODY=yes \ -D PKG_MOLECULE=yes \ -D PKG_KSPACE=yes \ -D PKG_REPLICA=yes \ -D PKG_ASPHERE=yes \ -D PKG_RIGID=yes \ -D PKG_MPIIO=yes \ -D PKG_COMPRESS=yes \ -D PKG_H5MD=no \ -D PKG_OPENMP=yes \ -D CMAKE_POSITION_INDEPENDENT_CODE=yes \ -D CMAKE_EXE_FLAGS="-dynamic" \ -D FFT_FFTW_THREADS=on ```

I set up several LAMMPS jobs using as initial geometries four randomly selected frames of that dataset_1593.xyz dataset and my two trained Allegro models (model-wrapped_08-aug-2024.pth and model-unwrapped_08-aug-2024.pth).

Example LAMMPS input script (input.lammps)

```bash # PART A - ENERGY MINIMIZATION # 1) Initialization units metal dimension 3 atom_style atomic boundary p p p # 2) System definition # initial_frame.data will be written into the working directory where this # script is located. read_data initial_frame.data # 3) Simulation settings # pair_style lj/cut 2.5 mass 1 2.016 mass 2 15.999 pair_style allegro3232 pair_coeff * * ../../h2o-behlerdataset-allegro-train-layerhpo-fe76_07-aug-2024/model-unwrapped_08-aug-2024.pth H O # Or "model-wrapped_08-aug-2024.pth" if using frames from the wrapped dataset. # 4) Visualization thermo 1 thermo_style custom step time temp pe ke etotal epair ebond econserve fmax # Also want to dump the CG minimization trajectory dump mintraj all atom 1 minimization.lammpstrj # 5) Run CG minimization, doing a single static balance first to print out subdomain cut locations. balance 1.0 shift xyz 100 1.0 minimize 1.0e-8 1.0e-8 1000 1000000000 undump mintraj reset_timestep 0 time 0.0 # PART B - MOLECULAR DYNAMICS delete_atoms overlap 0.1 all all # Logging thermo 1 # Try to rebuild neighbor lists more often neigh_modify every 1 delay 0 check yes binsize 10.0 # Also try to specify larger cutoff for ghost atoms to avoid losing atoms. comm_modify mode single cutoff 10.0 vel yes # Try specifying initial velocities for all atoms velocity all create 298.0 4928459 dist gaussian # Run MD in the NVT ensemble, with a Nosé-Hoover thermostat-barostat starting at 298.0 K. fix mynose all nvt & temp 298.0 298.0 0.011 & # Be sure to dump the MD trajectory dump mdtraj all atom 1 mdtraj.lammpstrj dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz timestep 0.0005 # Normal run, with a single balance first balance 1.0 shift xyz 100 1.0 run 20000 undump mdtraj undump mdforces ```

Example input geometry (frame 596 of the original dataset)

``` model-unwrapped_08-aug-2024.pth-frame596-kokkos/initial_frame.data (written by ASE) 192 atoms 2 atom types 0.0 23.14461 xlo xhi 0.0 23.14461 ylo yhi 0.0 23.14461 zlo zhi Atoms 1 2 5.9091500000000003 9.6715599999999995 3.23007 2 1 6.0198799999999997 11.3194 2.4536199999999999 3 1 5.4012099999999998 9.8348099999999992 4.9311199999999999 4 2 22.080200000000001 1.6895899999999999 13.5389 5 1 21.2834 0.52559500000000003 14.782999999999999 6 1 20.552199999999999 2.33968 12.6881 7 2 11.1053 14.1869 6.9613100000000001 8 1 10.151300000000001 13.8668 5.6778700000000004 9 1 12.3499 12.8649 6.86449 10 2 16.049700000000001 19.7988 10.1698 11 1 17.095400000000001 18.6266 9.2250499999999995 12 1 16.745999999999999 19.773900000000001 11.8696 13 2 7.2639100000000001 14.3575 22.238099999999999 14 1 6.4026899999999998 14.9102 20.775099999999998 15 1 8.9776199999999999 14.8261 21.689800000000002 16 2 14.0701 11.291600000000001 13.6586 17 1 13.208399999999999 12.4938 12.5162 18 1 13.355499999999999 11.535299999999999 15.2211 19 2 2.2179700000000002 6.5658799999999999 11.150600000000001 20 1 2.0281199999999999 7.61151 12.1075 21 1 2.2052399999999999 5.03423 12.0829 22 2 23.1968 17.582100000000001 12.8947 23 1 1.1530100000000001 16.1249 13.3102 24 1 1.0125599999999999 18.971399999999999 12.323700000000001 25 2 13.076599999999999 14.747400000000001 0.96582699999999999 26 1 14.6065 15.859400000000001 1.2677700000000001 27 1 13.7464 13.523099999999999 2.1084399999999999 28 2 22.289999999999999 23.165099999999999 7.9897999999999998 29 1 22.927099999999999 0.161326 9.6950800000000008 30 1 0.51338300000000003 22.870999999999999 6.6900599999999999 31 2 18.6097 13.7941 11.648199999999999 32 1 18.463799999999999 15.2698 10.5924 33 1 17.077100000000002 13.6417 12.535 34 2 8.8639799999999997 22.649000000000001 3.5996299999999999 35 1 10.118399999999999 21.665299999999998 4.2162199999999999 36 1 9.3582199999999993 0.085460999999999995 1.9719199999999999 37 2 2.05017 16.757200000000001 4.8366499999999997 38 1 1.44848 15.0244 5.2370700000000001 39 1 3.0690900000000001 17.333600000000001 6.0353300000000001 40 2 22.596299999999999 10.3612 14.7997 41 1 21.734500000000001 8.8241099999999992 14.8378 42 1 21.906500000000001 11.1448 13.2875 43 2 4.3652499999999996 18.384399999999999 19.2408 44 1 3.2898800000000001 16.997699999999998 18.443200000000001 45 1 4.6726099999999997 19.560600000000001 17.828199999999999 46 2 5.5318699999999996 9.4371899999999993 7.85025 47 1 6.7868599999999999 8.3062900000000006 8.1544899999999991 48 1 3.9088400000000001 8.6284299999999998 8.6374399999999998 49 2 20.0229 4.7849899999999996 20.125499999999999 50 1 21.5364 5.3525200000000002 19.273299999999999 51 1 18.720700000000001 5.8096899999999998 19.124700000000001 52 2 13.950200000000001 22.8736 21.118300000000001 53 1 14.283899999999999 21.484200000000001 22.181999999999999 54 1 12.941000000000001 22.204699999999999 19.863 55 2 20.478899999999999 6.6597200000000001 7.2229599999999996 56 1 20.589700000000001 8.1250999999999998 8.3511699999999998 57 1 19.509 7.0342799999999999 5.7630600000000003 58 2 11.3894 3.5159799999999999 0.69348500000000002 59 1 12.741899999999999 3.6259199999999998 1.8147 60 1 12.0502 2.2654100000000001 22.613800000000001 61 2 4.29528 8.6966599999999996 18.260999999999999 62 1 5.0197099999999999 7.1608200000000002 19.0062 63 1 5.0883900000000004 9.8124300000000009 17.133199999999999 64 2 19.012699999999999 15.0678 21.8294 65 1 20.098199999999999 13.677 21.0869 66 1 19.755400000000002 16.159800000000001 22.837800000000001 67 2 18.5093 18.2121 14.6373 68 1 18.847100000000001 19.651299999999999 15.6972 69 1 20.258400000000002 17.533999999999999 14.4245 70 2 20.737200000000001 6.5509399999999998 1.7896799999999999 71 1 20.214300000000001 6.2219499999999996 0.10664999999999999 72 1 20.418800000000001 5.1312300000000004 2.89337 73 2 11.9665 15.8748 12.234299999999999 74 1 11.057 15.8642 10.595700000000001 75 1 12.898199999999999 17.384 12.1503 76 2 7.3605999999999998 15.648300000000001 3.7366000000000001 77 1 6.3703000000000003 15.7403 2.1144099999999999 78 1 6.0848300000000002 15.904400000000001 5.0577399999999999 79 2 11.4945 12.6968 18.411100000000001 80 1 11.9612 11.419499999999999 19.6524 81 1 12.6492 14.269299999999999 18.424700000000001 82 2 6.2639100000000001 4.2544399999999998 21.3217 83 1 4.6653799999999999 3.6882799999999998 22.102699999999999 84 1 7.4899199999999997 4.5268499999999996 22.556100000000001 85 2 1.97488 2.7933599999999998 0.38653500000000002 86 1 1.5985100000000001 2.8812000000000002 2.2288800000000002 87 1 0.46284500000000001 3.76661 22.937000000000001 88 2 15.752800000000001 1.73661 16.740600000000001 89 1 15.033899999999999 2.6949800000000002 15.4696 90 1 14.998799999999999 0.69597399999999998 18.0061 91 2 1.0868500000000001 4.8048799999999998 17.550000000000001 92 1 1.2302599999999999 6.5176400000000001 17.189800000000002 93 1 0.442882 3.9655900000000002 16.058299999999999 94 2 4.2452399999999999 23.068000000000001 6.1464999999999996 95 1 4.5604500000000003 22.587299999999999 4.4141199999999996 96 1 4.2884799999999998 1.8068200000000001 6.0832800000000002 97 2 8.8760899999999996 20.350000000000001 21.2438 98 1 8.4095899999999997 18.556999999999999 21.336300000000001 99 1 9.3341899999999995 20.768999999999998 -0.27107100000000001 100 2 1.2605500000000001 11.5184 3.75576 101 1 -0.0060600000000000003 11.8523 2.5569799999999998 102 1 2.7166100000000002 10.385400000000001 2.84429 103 2 21.119399999999999 18.853000000000002 1.2561500000000001 104 1 22.4404 18.017399999999999 2.3485900000000002 105 1 22.2773 19.497399999999999 0.028347000000000001 106 2 13.842000000000001 10.01 4.5879599999999998 107 1 13.560499999999999 9.4886400000000002 2.7970199999999998 108 1 15.4298 10.9476 4.7816700000000001 109 2 20.032299999999999 18.190999999999999 7.6835699999999996 110 1 20.9389 20.223199999999999 7.5477999999999996 111 1 21.233899999999998 17.8385 9.0412300000000005 112 2 9.7168899999999994 6.8385300000000004 8.7584900000000001 113 1 9.2112200000000009 5.1589 7.9785599999999999 114 1 9.6004199999999997 6.3682999999999996 10.559699999999999 115 2 7.6963800000000004 12.7331 14.7662 116 1 7.3576100000000002 13.795999999999999 13.268700000000001 117 1 9.4222699999999993 12.6839 15.1022 118 2 20.122800000000002 22.088200000000001 17.957899999999999 119 1 19.1126 0.304678 18.636299999999999 120 1 21.934100000000001 21.968599999999999 18.249700000000001 121 2 12.8415 19.258099999999999 5.8808800000000003 122 1 13.239100000000001 19.8367 7.6356799999999998 123 1 11.4961 18.447900000000001 6.2407700000000004 124 2 14.9815 16.554600000000001 18.867599999999999 125 1 15.806100000000001 16.940799999999999 17.265499999999999 126 1 17.521599999999999 15.5898 21.355899999999998 127 2 0.834036 11.5718 21.333100000000002 128 1 1.8740600000000001 12.8818 22.0655 129 1 1.97116 10.260199999999999 20.686599999999999 130 2 9.2284900000000007 7.4919900000000004 14.4091 131 1 9.9167799999999993 7.5134299999999996 15.6699 132 1 8.8360599999999998 9.2631099999999993 14.0289 133 2 11.2926 9.4404199999999996 23.383500000000002 134 1 11.2235 7.7517800000000001 23.410699999999999 135 1 9.5579999999999998 9.8135600000000007 0.48632599999999998 136 2 17.805299999999999 7.6026400000000001 14.9404 137 1 17.738900000000001 8.6731300000000005 13.872199999999999 138 1 18.512699999999999 6.5323500000000001 13.7552 139 2 13.8725 6.8589799999999999 18.3369 140 1 14.7568 7.7961799999999997 16.747599999999998 141 1 14.2013 4.9680799999999996 18.401 142 2 23.0228 11.8605 8.8937100000000004 143 1 23.116099999999999 11.4161 7.14086 144 1 21.6477 13.0159 8.9233600000000006 145 2 8.7789699999999993 1.75627 9.8980899999999998 146 1 10.4735 1.3146899999999999 10.7338 147 1 8.6431199999999997 0.58794000000000002 8.4552099999999992 148 2 16.2333 10.4991 21.402000000000001 149 1 15.495200000000001 9.1909200000000002 20.715499999999999 150 1 14.9786 11.777100000000001 21.8293 151 2 19.203800000000001 2.22892 4.6112099999999998 152 1 18.985499999999998 1.41039 3.15361 153 1 19.7928 1.4284600000000001 6.0571200000000003 154 2 5.6809399999999997 3.0451199999999998 13.1189 155 1 7.1078200000000002 2.3069199999999999 12.010300000000001 156 1 6.3725800000000001 4.65273 13.7621 157 2 4.4194199999999997 19.215199999999999 10.3683 158 1 5.1798700000000002 20.2973 9.0558099999999992 159 1 5.5698600000000003 17.738399999999999 10.7018 160 2 14.1355 1.4335500000000001 10.208299999999999 161 1 13.9558 2.3745500000000002 8.6041399999999992 162 1 15.8993 23.871400000000001 10.3802 163 2 2.6869399999999999 4.59931 5.1904700000000004 164 1 3.9838100000000001 5.8780299999999999 4.4879899999999999 165 1 1.6192 5.6871299999999998 6.0968900000000001 166 2 18.5716 4.4131600000000004 11.1807 167 1 18.793900000000001 4.3240100000000004 9.4553799999999999 168 1 16.422000000000001 3.7141500000000001 11.2049 169 2 14.390499999999999 4.1494999999999997 5.31778 170 1 16.127500000000001 4.6077500000000002 4.9381399999999998 171 1 13.5379 5.9167800000000002 5.6488399999999999 172 2 11.933199999999999 20.552900000000001 14.781700000000001 173 1 13.100099999999999 20.525500000000001 16.201599999999999 174 1 10.1973 20.4255 15.465400000000001 175 2 18.6493 12.014699999999999 3.00224 176 1 19.017399999999999 12.237500000000001 1.32315 177 1 19.223800000000001 10.126899999999999 2.9812400000000001 178 2 6.4923799999999998 14.4697 10.1922 179 1 5.3334700000000002 14.067399999999999 8.8764900000000004 180 1 7.8080600000000002 13.923400000000001 9.6611799999999999 181 2 6.55314 21.3005 15.272399999999999 182 1 6.0889600000000002 20.022300000000001 13.979799999999999 183 1 5.2370400000000004 22.694900000000001 15.271599999999999 184 2 16.850000000000001 21.171299999999999 2.4119799999999998 185 1 17.9634 19.734100000000002 2.57178 186 1 15.6921 20.852900000000002 3.6812200000000002 187 2 1.12944 21.249300000000002 21.300999999999998 188 1 2.2810199999999998 19.896999999999998 20.537099999999999 189 1 1.8933500000000001 22.6784 22.1388 190 2 2.6025700000000001 14.353199999999999 16.236699999999999 191 1 4.3327799999999996 13.993 16.651299999999999 192 1 1.8016700000000001 12.796900000000001 15.3421 ```

I ran these 8 jobs both with and without Kokkos, like this:

Example job script

```bash #!/bin/bash #SBATCH --job-name=model-unwrapped_08-aug-2024.pth-frame607 #SBATCH --account=... #SBATCH --qos=debug #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=128 #SBATCH --exclusive #SBATCH --time=10:00 #SBATCH --error=vt_lammps%j.err #SBATCH --output=vt_lammps%j.out #SBATCH --mail-user= #SBATCH --mail-type=ALL # #SBATCH --open-mode=append # OpenMP parallelization export OMP_NUM_THREADS=1 export OMP_PLACES=cores export OMP_PROC_BIND=spread # By default, prefer the GCC10 build of LAMMPS + pair_allegro module load lammps-tpc/2Aug23/gcc10-allegro-cpu # Ensure that stack size is unlimited, or you may get a segfault error when # attempting to run a MPI job. ulimit -s unlimited ulimit -S unlimited ulimit -H unlimited srun lmp -in input.lammps # or "srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps", if running with Kokkos. ```

Nearly all cases results in the Torch reshape error, after the simulation has proceeded for some number of steps. (The one case where it has not is a segfault.)

	Last MD or minimization step	Torch reshape error?
model-wrapped_08-aug-2024.pth-frame607-kokkos	321	True
model-wrapped_08-aug-2024.pth-frame607	1741	True
model-wrapped_08-aug-2024.pth-frame596-kokkos	31	True
model-wrapped_08-aug-2024.pth-frame596	454	True
model-wrapped_08-aug-2024.pth-frame1351-kokkos	270	True
model-wrapped_08-aug-2024.pth-frame1351	616	True
model-wrapped_08-aug-2024.pth-frame1252-kokkos	179	True
model-wrapped_08-aug-2024.pth-frame1252	1497	True
model-unwrapped_08-aug-2024.pth-frame607-kokkos	330	True
model-unwrapped_08-aug-2024.pth-frame607	271	True
model-unwrapped_08-aug-2024.pth-frame596-kokkos	31	True
model-unwrapped_08-aug-2024.pth-frame596	1744	True
model-unwrapped_08-aug-2024.pth-frame1351-kokkos	290	False
model-unwrapped_08-aug-2024.pth-frame1351	754	True
model-unwrapped_08-aug-2024.pth-frame1252-kokkos	182	True
model-unwrapped_08-aug-2024.pth-frame1252	1734	True

Additionally, I examined the domain decomposition chosen for one typical job, using the X, Y, and Z cut locations printed to screen by the LAMMPS balance command to count how many atoms were in domain at each frame of the simulation. While the precision of these cut points undoubtedly causes some rounding error, I was surprised to find that 32 of the 128 domains were empty at the first frame of MD simulation. So it may be more complex than a domain simply going empty at a later point in the simulation.

However, my limited testing does support the idea that empty domains make the simulation more likely to crash from a Torch reshape error. On another system I'm researching with ~500 atoms (water solvated transition metal atoms), I see approximate inverse proportionality in the number of steps my MD simulation will complete before a Torch reshape crash and the number of domains:

64 MPI tasks: 845 steps
32 MPI tasks: 1644 steps
16 MPI tasks: O(5000) steps by the time the job times out at 10 min.

This happens even despite the fact that using fewer domains for some reason tends to produce different results for the pre-MD conjugate gradient minimization. 16 MPI tasks produce a minimized geometry where atoms concentrated in one half of the square box, while larger numbers of MPI tasks produce more uniformly distributed geometry, so the fact that the 16-task job does not produce a Torch reshape error even at O(5000) steps makes empty domains seem a more likely cause of this error.

I'm not sure what else to try. I've tried things like forcing a domain rebalance after each MD step and increasing the neighbor list and ghost atom communication cutoffs, but I'm still encountering Torch reshape errors for all but the smallest number of domains.

Do you have any guidance on what to try, or tests you'd like me to run?

Thanks!

Edit: running larger systems with the same number of domain decompositions seems to work. The water system above is 64 waters in a box. If I instead replicate to 3 copies in each dimension (1728 waters), I can run 20k steps at 128 MPI tasks with no Torch reshape error, both with and without Kokkos.

mir-group / pair_allegro

allegro_pair style and empty partitions #45