[QUESTION] Error while using potential in lammps.

gshs12051 commented 1 year ago

I trained Allegro model in GeSe system and deployed the model to lammps pair potential following the steps. and I tried to do MD simulation using lammps input below.

echo both
boundary p p p
processors * * * grid numa
units metal
newton on
read_data coo
pair_style allegro
pair_coeff * * ../deployed.pth Ge Se

mass 1 72.61
mass 2 78.96

thermo 10
compute ppa all pe/atom
thermo_style custom step temp etotal pe press vol spcpu cpuremain
timestep 0.002

dump traj all custom 10 melt.dump id type x y z ix iy iz c_ppa fx fy fz
dump_modify traj sort id
velocity all create 1000 20201021
fix int all nve
run 2

Lammps works well in the case of run1 but in the case of run more than 1 (ex run 2 or more) lammps terminate with the errors below.

LAMMPS (29 Sep 2021 - Update 3)
  using 1 OpenMP thread(s) per MPI task
boundary p p p
processors * * * grid numa
units metal
newton on
read_data coo
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (15.141223 15.141223 15.141223) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  1 by 1 by 1 core grid within node
  reading atoms ...
  120 atoms
  read_data CPU = 0.001 seconds
#replicate 3 3 3
pair_style allegro
Allegro is using device cpu
pair_coeff * * ../deployed.pth Ge Se
Allegro: Loading model from ../deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | Ge | 1 | Ge
1 | Se | 2 | Se

mass 1 72.61
mass 2 78.96

thermo 10
compute ppa all pe/atom
thermo_style custom step temp etotal pe press vol spcpu cpuremain
timestep 0.002

dump traj all custom 10 melt.dump id type x y z ix iy iz c_ppa fx fy fz
dump_modify traj sort id
velocity all create 1000 20201021
fix int all nve
run 2
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro, perpetual
      attributes: full, newton on, ghost
      pair build: full/bin/ghost
      stencil: full/ghost/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 4.612 | 4.612 | 4.612 Mbytes
Step Temp TotEng PotEng Press Volume S/CPU CPULeft 
       0         1000   -463.34487   -478.72682    4733.1234    3471.2258            0            0 
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Unsupported value kind: Tensor

Aborted

in the case of run 1, lammps ended successfully with the output below

LAMMPS (29 Sep 2021 - Update 3)
  using 1 OpenMP thread(s) per MPI task
boundary p p p
processors * * * grid numa
units metal
newton on
read_data coo
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (15.141223 15.141223 15.141223) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  1 by 1 by 1 core grid within node
  reading atoms ...
  120 atoms
  read_data CPU = 0.001 seconds
#replicate 3 3 3
pair_style allegro
Allegro is using device cpu
pair_coeff * * ../deployed.pth Ge Se
Allegro: Loading model from ../deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | Ge | 1 | Ge
1 | Se | 2 | Se

mass 1 72.61
mass 2 78.96

thermo 10
compute ppa all pe/atom
thermo_style custom step temp etotal pe press vol spcpu cpuremain
timestep 0.002

dump traj all custom 10 melt.dump id type x y z ix iy iz c_ppa fx fy fz
dump_modify traj sort id
velocity all create 1000 20201021
fix int all nve
run 1
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro, perpetual
      attributes: full, newton on, ghost
      pair build: full/bin/ghost
      stencil: full/ghost/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 4.612 | 4.612 | 4.612 Mbytes
Step Temp TotEng PotEng Press Volume S/CPU CPULeft 
       0         1000   -463.34487   -478.72682    4733.1234    3471.2258            0            0 
       1    994.88372   -463.34483   -478.64809    4708.9075    3471.2258    1.7103956            0 
Loop time of 0.584683 on 1 procs for 1 steps with 120 atoms

Performance: 0.296 ns/day, 81.206 hours/ns, 1.710 timesteps/s
94.0% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.58463    | 0.58463    | 0.58463    |   0.0 | 99.99
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 6.9141e-06 | 6.9141e-06 | 6.9141e-06 |   0.0 |  0.00
Output  | 3.5048e-05 | 3.5048e-05 | 3.5048e-05 |   0.0 |  0.01
Modify  | 2.861e-06  | 2.861e-06  | 2.861e-06  |   0.0 |  0.00
Other   |            | 5.484e-06  |            |       |  0.00

Nlocal:        120.000 ave         120 max         120 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        936.000 ave         936 max         936 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      8850.00 ave        8850 max        8850 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 8850
Ave neighs/atom = 73.750000
Neighbor list builds = 0
Dangerous builds = 0
write_data liq.dat
System init for write_data ...

Total wall time: 0:00:01

Linux-cpp-lisp commented 1 year ago

Hi @gshs12051 ,

Thanks for your interest in our work!

What is your PyTorch version? This looks like a familiar bug from PyTorch that should be resolved by updating to the latest support stable version (1.11).

gshs12051 commented 1 year ago

Thanks. I was using Pytorch 1.10 version and after updating to 1.11 version the problem solved. I have two more questions. First question is in the case of MD simulation with MPI. LAMMPS didn't proceed after this stage. mpirun -np 8 lmp -sf omp -pk omp 4 -in in.lammps

run 10
No /omp style for force computation currently active

While it works well in the case of mpirun -np 4 lmp -sf omp -pk omp 8 -in in.lammps like below. I am wondering if there is a specific limit in MPI processor grid size. and sometime MD simluation ends with error below

  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 11.64 | 11.64 | 11.64 Mbytes
Step Temp TotEng PotEng Press Volume S/CPU CPULeft 
       0         1000   -502.17886   -517.56082    4733.1234    3471.2258            0            0 
      10    1037.6838   -502.17892   -518.14053    4911.4855    3471.2258   0.62056358    96670.196 
      20    1149.0517   -502.18269   -519.85735    5438.6038    3471.2258    3.3797967    57200.356 
      30    1366.1239   -502.20265   -523.21631    6466.0332    3471.2258    3.3869016    44029.363 
      40    1706.0198   -502.27363   -528.51555    8074.8025    3471.2258    3.3691646     37465.69 
      50      2092.37   -502.46846    -534.6532    9903.4456    3471.2258      0.80146    44927.751 
      60    2388.6437   -502.88855   -539.63056    11305.746    3471.2258   0.49348786    57677.206 
      70    2591.1369   -503.60771   -543.46446    12264.171    3471.2258   0.49194526    66832.571 
      80    2867.5918   -504.70262   -548.81179    13572.666    3471.2258   0.54046821    72327.097 
      90    3162.0488   -506.21135   -554.84985    14966.367    3471.2258   0.47370461    78332.381 
     100    3463.3768   -508.07882   -561.35234     16392.59    3471.2258   0.43357856    84302.633 
     110    3783.0973   -510.20537   -568.39681    17905.867    3471.2258    0.4624285    88399.774 
     120    4040.5194   -512.46371   -574.61481    19124.277    3471.2258    0.4751138    91522.343 
     130    3916.8145   -468.80556   -529.05384    18538.766    3471.2258   0.56733556    92585.622 
     140    4160.4834   -471.28922    -535.2856    19692.081    3471.2258   0.48887372    94704.054 
     150    5138.8348   -472.80773   -551.85307    24322.739    3471.2258   0.47169693    96834.505 
     160    5735.4544   -477.15941   -565.38192    27146.614    3471.2258   0.47489075    98642.676 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 39872 RUNNING AT n020
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

And next question is during the training, I tried to use the train set of multiple cell size. (for example some training set of 120 atoms and some training set of 60 atoms) Then the training ended with the errors below.

instantiate NpzDataset
   optional_args :                                         key_mapping
   optional_args :                                npz_fixed_field_keys
   optional_args :                                                root
   optional_args :                                  extra_fixed_fields <-                         dataset_extra_fixed_fields
   optional_args :                                           file_name <-                                  dataset_file_name
...NpzDataset_param = dict(
...   optional_args = {'key_mapping': {'z': 'atomic_numbers', 'E': 'total_energy', 'F': 'forces', 'R': 'pos'}, 'include_keys': [], 'npz_fixed_field_keys': ['atomic_numbers'], 'file_name': './train_set.npz', 'url': None, 'force_fixed_keys': [], 'extra_fixed_fields': {'r_max': 4.0}, 'include_frames': None, 'root': 'results/GeSe2'},
...   positional_args = {'type_mapper': <nequip.data.transforms.TypeMapper object at 0x2b9f505d7490>})
Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 681, in __init__
    super().__init__(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in __init__
    self._process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
    self.process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 269, in process
    data_list = [
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 270, in <listcomp>
    constructor(**{**{f: v[i] for f, v in fields.items()}, **fixed_fields})
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 326, in from_points
    return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 221, in __init__
    _process_dict(kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 163, in _process_dict
    raise ValueError(
ValueError: atomic_numbers is a node field but has the wrong dimension torch.Size([72, 1])

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 74, in main
    trainer = fresh_start(config)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 177, in fresh_start
    dataset = dataset_from_config(config, prefix="dataset")
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
    instance, _ = instantiate(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
    raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`

Linux-cpp-lisp commented 1 year ago

Hi @gshs12051 ,

Great, glad it resolved your issue!

Could you please open a new issue on pair_allegro (this repo) for the MPI question, and a separate issue on the nequip repo for the training issue? This helps keep information searchable and organized for future users.

Thanks!

mir-group / pair_allegro

[QUESTION] Error while using potential in lammps. #5