mir-group / pair_allegro

LAMMPS pair style for Allegro deep learning interatomic potentials with parallelization support
https://www.nature.com/articles/s41467-023-36329-y
MIT License
36 stars 8 forks source link

Error with the new pair_allegro-stress branch #12

Closed nukenadal closed 1 year ago

nukenadal commented 1 year ago

Hi Allegro developers,

Thank you for making the update on the stress branch of pair_allegro. I am trying to install this new branch to allow stress prediction in my MD production. Previously I have used the main branch and made several successful runs on NVT MD. But with this stress branch, some errors occurred when producing the NpT MD trajectories.

I used the following versions of the packages: torch 1.11.0 libtorch 1.13.0+cu116 (since the documentation says avoid 1.12) CUDA 11.4.2 cuDNN 8.5.0.96

The LAMMPS was successfully compiled according to the documentation, the FF was trained with stresses included. In the following MD run, I got LAMMPS output as:

LAMMPS (29 Sep 2021)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (36.000000 36.000000 36.000000) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  2592 atoms
  read_data CPU = 0.023 seconds
Allegro is using input precision f and output precision d
Allegro is using device cuda
Allegro: Loading model from ./ff_m3.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | H | 1 | H
1 | Pb | 2 | Pb
2 | C | 3 | C
3 | Br | 4 | Br
4 | N | 5 | N
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 7
  ghost atom cutoff = 7
  binsize = 3.5, bins = 11 11 11
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro, perpetual
      attributes: full, newton on, ghost
      pair build: full/bin/ghost
      stencil: full/ghost/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001

It stopped with this error:

terminate called after throwing an instance of 'c10::Error'
  what():  expected scalar type Double but found Float
Exception raised from data_ptr<double> at aten/src/ATen/core/TensorMethods.cpp:20 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x147bab53f86e in /home/libtorch-new/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x147bab50a3a8 in /home/libtorch-new/lib/libc10.so)
frame #2: double* at::TensorBase::data_ptr<double>() const + 0x116 (0x147b8207d796 in /home/libtorch-new/lib/libtorch_cpu.so)
frame #3: at::TensorAccessor<double, 2ul, at::DefaultPtrTraits, long> at::TensorBase::accessor<double, 2ul>() const & + 0x4b (0x7198eb in /home/lammps-stable_29Sep2021/build/lmp)
frame #4: /home/lammps-stable_29Sep2021/build/lmp() [0x722dd2]
frame #5: /home/lammps-stable_29Sep2021/build/lmp() [0x55bd22]
frame #6: /home/lammps-stable_29Sep2021/build/lmp() [0x5163c9]
frame #7: /home/lammps-stable_29Sep2021/build/lmp() [0x468e4a]
frame #8: /home/lammps-stable_29Sep2021/build/lmp() [0x469553]
frame #9: /home/lammps-stable_29Sep2021/build/lmp() [0x444b28]
frame #10: __libc_start_main + 0xf3 (0x147b51b9f493 in /lib64/libc.so.6)
frame #11: /home/lammps-stable_29Sep2021/build/lmp() [0x44502e]

/var/spool/PBS/mom_priv/jobs/6986068.pbs.SC: line 11: 2239752 Aborted                 /home/lammps-stable_29Sep2021/build/lmp -in ./npt.in > logfile_$PBS_JOBID

I wonder if this is because I used the wrong combination of packages that the new stress branch is not compatible with?

Thank you so much.

Linux-cpp-lisp commented 1 year ago

Hi @nukenadal .

Thanks for your interest in our code and for trying this new feature!

The stress branch is built for the defaults to be compatible with the current develop branch of nequip, soon to be released as 0.6.0. As a result, if your model is from an earlier version of nequip, you'll need to specify pair_style allegro3232 on stress instead of pair_style allegro to get the compatible version.

Hopefully this resolves the issue, and please let me know if you observe any issues or suspicious results when using the stress branch!

nukenadal commented 1 year ago

Thank you so much for your rapid response!

The above problem is resolved with the _pairstyle allegro3232. I tested an NpT MD run, but I found that the stresses are not predicted as how I expected. I used the normal npt ensemble from LAMMPS as if I were using a classical potential, with the command: fix 1 all npt temp 200 200 $(100*dt) tri 1.0 1.0 $(50*dt). The aim here is to simulate the structure under ambient pressure (1 bar as I used the metal unit). Below is an example of my training dataset header:

Lattice="12.002029185229819 0.004064393052959 -0.051196621492764 -0.000000000004469 11.711580380388035 -0.008099807301418 0.000000000001293 -0.000000000000898 12.039917140248585" 
Properties=species:S:1:pos:R:3:forces:R:3 energy=-1141.403858931822 
stress="0.004046574721405 -0.001809377097797 0.000241330359958 -0.001809377097797 0.000898277454365 -0.001171032253305 0.000241330359958 -0.001171032253305 0.000988749363509" 
free_energy=-1141.403858931822 pbc="T T T" 

Where the stresses are saved with the unit eV per angstrom cubed. Then in the MD production, I had a printout like this:

Step CPU Time Temp Press Volume TotEng 
  0       0.00         0.000          200.000           50652.238              47241.633                -29903.285 
 50      58.76         0.025          266.807            1806.511              62650.361                -30039.377 
100      97.48         0.050          421.660            5194.959             106511.009                -30309.355 
150     120.55         0.075          385.755            1111.530             156182.512                -30445.874 
200     136.21         0.100          316.264             851.025             203929.625                -30503.778 
250     148.06         0.125          259.946             552.863             267116.349                -30550.031 
300     157.38         0.150          217.007             369.768             342194.282                -30582.112 
350     164.71         0.175          183.242             217.801             420573.110                -30603.389 
400     170.68         0.200          167.577             125.765             497445.470                -30614.017 
450     175.88         0.225          165.960              75.794             573206.403                -30616.751 
500     180.61         0.250          172.086              64.869             652145.704                -30615.140 
550     185.00         0.275          178.895              58.633             741854.160                -30612.851 
600     189.05         0.300          184.556              49.415             842463.753                -30610.630 
650     192.79         0.325          190.739              40.879             958897.820                -30608.741 
700     196.31         0.350          191.660              37.834            1086485.734                -30607.029 
750     199.63         0.375          196.898              30.968            1229353.201                -30605.731 
800     202.77         0.400          199.433              25.713            1388276.031                -30604.834 
850     205.78         0.425          200.112              20.940            1559556.374                -30604.274 
900     208.66         0.450          200.412              19.134            1743025.867                -30603.748 
950     211.45         0.475          200.821              16.344            1946113.740                -30603.529 
1000     214.18         0.500          202.782              11.614            2161252.664                -30603.517 
1050     216.84         0.525          200.039              10.945            2394540.916                -30603.602 
1100     219.44         0.550          198.836               9.568            2640828.010                -30603.645 
1150     221.99         0.575          199.692               8.230            2906593.920                -30603.578 

The pressure was initially quite high, and then lowered towards 1 bar. I was wondering if there is any additional setting that is required to reconcile the stresses from the FF and how LAMMPS handles the pressure of the cell? I used both the main branch of Nequip=0.5.6 and Allegro=0.2.0 if that is relevant.

Thank you!

Hongyu-yu commented 1 year ago

Hi @nukenadal, you can have a try with the para_stress branch of Nequip and stress branch of pair_allegro and train with config.yaml including ParaStressForceOutput instead of StressForceOutput for your network. Details of the changes can be seen in PR_pair_allegro and PR_nequip and more discussions in this issue. It works well on my side and hope it helps.

Linux-cpp-lisp commented 1 year ago

@nukenadal : have you verified with a test set / validation set, such as with nequip-evaluate, that your model's stress predictions are indeed correct?

It's hard for us to say without more details on the system, training data, and LAMMPS input file...

nukenadal commented 1 year ago

Sorry, I found that the above MD calculation had a mismatched FF and input structure. That is probably why the structure disintegrated rapidly. I retrained the FF with more data and epochs. It gives the following test results from nequip-evaluate. The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?

               f_mae =  0.010270           
              f_rmse =  0.014187           
          stress_mae =  0.000240           
         stress_rmse =  0.000424           
               e_mae =  0.032459           
             e/N_mae =  0.000338         

Moreover, I tried re-compiling LAMMPS because I accidentally wiped out the build folder. With the same modules and package versions (also with allegro3232 enabled), the MD production gives the following errors:

[cx3-11-8:2563622:0:2563622] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2563622) ====
 0 0x0000000000012c20 .annobin_sigaction.c()  sigaction.c:0
 1 0x0000000004c3b0c4 torch::jit::InterpreterStateImpl::callstack()  ???:0
 2 0x0000000004c3cbf1 torch::jit::InterpreterStateImpl::handleError()  ???:0
 3 0x0000000004c4ab85 torch::jit::InterpreterStateImpl::runImpl()  ???:0
 4 0x0000000004c3604f torch::jit::InterpreterState::run()  ???:0
 5 0x0000000004c264da torch::jit::GraphExecutorImplBase::run()  ???:0
 6 0x00000000048adf8e torch::jit::Method::operator()()  ???:0
 7 0x00000000007179ac torch::jit::Module::forward()  /home/libtorch-new/include/torch/csrc/jit/api/module.h:114
 8 0x0000000000720ee1 LAMMPS_NS::PairAllegro<(Precision)0>::compute()  /home/lammps-stable_29Sep2021/src/pair_allegro.cpp:442
 9 0x000000000055b366 LAMMPS_NS::Verlet::run()  /home/lammps-stable_29Sep2021/src/verlet.cpp:312
10 0x00000000005163fb LAMMPS_NS::Run::command()  /home/lammps-stable_29Sep2021/src/run.cpp:180
11 0x0000000000468e4a LAMMPS_NS::Input::execute_command()  /home/lammps-stable_29Sep2021/src/input.cpp:794
12 0x0000000000469553 LAMMPS_NS::Input::file()  /home/lammps-stable_29Sep2021/src/input.cpp:273
13 0x0000000000444b28 main()  /home/lammps-stable_29Sep2021/src/main.cpp:98
14 0x0000000000023493 __libc_start_main()  ???:0
15 0x000000000044502e _start()  ???:0
=================================
/var/spool/PBS/mom_priv/jobs/7031777.pbs.SC: line 11: 2563622 Segmentation fault      /home/lammps-stable_29Sep2021/build/lmp -in ./npt_org_perov.in > logfile_$PBS_JOBID

I tested multiple combinations of packages but this issue remains. I was wondering if you have any suggestions on this type of error? Thanks so much!

Linux-cpp-lisp commented 1 year ago

Sorry, I found that the above MD calculation had a mismatched FF and input structure.

👍

The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?

This is an error of less than 1%, which sounds excellent; in an absolute sense it also looks good to me... @simonbatzner ?

Regarding the segfault, in general we've seen this either in certain cases if you run out of memory, or if there is an empty simulation domain in a parallel LAMMPS simulation... do either of these sound applicable?

nukenadal commented 1 year ago

With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories. Thank you so much for the valuable advice!

Some extra points are, though not quite relevant,

  1. The MD production with 1 GPU is actually faster than using 2 GPUs. I am not sure if this is related to my network setting, making it not very scalable. Or this is somehow a feature of this experimental version?
  2. I also tried this workflow on my local workstation. I got the following error when trying to load the dataset for training:
    Traceback (most recent call last):
    File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 604, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
    OSError: [Errno 28] No space left on device

I wonder if this is telling me that I don't have enough storage space for some temporary data or a memory error like above?

Linux-cpp-lisp commented 1 year ago

Hi @nukenadal ,

With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories.

Great! This should also be possible to resolve, then, by running on more GPUs with the same model.

The MD production with 1 GPU is actually faster than using 2 GPUs. I am not sure if this is related to my network setting, making it not very scalable. Or this is somehow a feature of this experimental version?

You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?

I also tried this workflow on my local workstation. I got the following error when trying to load the dataset for training: ... OSError

This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.

nukenadal commented 1 year ago

Hi,

You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?

I used about 2600 atoms again. I was comparing between 1 GPU on 1 node and 2 GPUs on 1 node. I compared the speed through the time taken for the same 2000 MD steps. I have got the following results:

for 1 GPU:

Loop time of 460.862 on 1 procs for 2000 steps with 2592 atoms

Performance: 0.187 ns/day, 128.017 hours/ns, 4.340 timesteps/s
99.5% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 458.13     | 458.13     | 458.13     |   0.0 | 99.41
Neigh   | 2.0769     | 2.0769     | 2.0769     |   0.0 |  0.45
Comm    | 0.047307   | 0.047307   | 0.047307   |   0.0 |  0.01
Output  | 0.39902    | 0.39902    | 0.39902    |   0.0 |  0.09
Modify  | 0.19997    | 0.19997    | 0.19997    |   0.0 |  0.04
Other   |            | 0.01258    |            |       |  0.00

Nlocal:        2592.00 ave        2592 max        2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        3591.00 ave        3591 max        3591 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      123132.0 ave      123132 max      123132 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 123132
Ave neighs/atom = 47.504630
Neighbor list builds = 174
Dangerous builds = 96
Total wall time: 0:07:58

And for 2 GPUs:

Loop time of 459.87 on 1 procs for 2000 steps with 2592 atoms

Performance: 0.188 ns/day, 127.742 hours/ns, 4.349 timesteps/s
20.3% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 457.31     | 457.31     | 457.31     |   0.0 | 99.44
Neigh   | 2.0862     | 2.0862     | 2.0862     |   0.0 |  0.45
Comm    | 0.051038   | 0.051038   | 0.051038   |   0.0 |  0.01
Output  | 0.19311    | 0.19311    | 0.19311    |   0.0 |  0.04
Modify  | 0.21184    | 0.21184    | 0.21184    |   0.0 |  0.05
Other   |            | 0.01453    |            |       |  0.00

Nlocal:        2592.00 ave        2592 max        2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        3592.00 ave        3592 max        3592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      123404.0 ave      123404 max      123404 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 123404
Ave neighs/atom = 47.609568
Neighbor list builds = 175
Dangerous builds = 97
Total wall time: 0:08:05

I wonder if this allows us to find out why the scaling is abnormal? As the percentage of time spent on Comm was similar.

This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.

This happened before the line "Loaded Dataset", as the first few lines in the log are:

Torch device: cuda
Processing dataset...
terminate called after throwing an instance of 'c10::Error'
Linux-cpp-lisp commented 1 year ago

20.3% CPU use with 1 MPI tasks x 1 OpenMP threads

^ from the second log

@nukenadal are you sure it is using two GPUs? have you checked nvtop while it's running, for example?

Linux-cpp-lisp commented 1 year ago

Regarding OOM, that looks like preprocessing, yes. You can preprocess your dataset on a normal CPU node (where hopefully you can allocate more RAM) using nequip-benchmark on the same YAML config, see https://github.com/mir-group/nequip/issues/293, for example, for more discussion.

nukenadal commented 1 year ago

are you sure it is using two GPUs? have you checked nvtop while it's running, for example?

I found that I may not have properly set the multi-GPU process, I wonder what command should I use to call say 2 GPUs on the same node? Is it just the same with mpirun -n x lmp -in lmp.in?

anjohan commented 1 year ago

Yes (or possibly a launcher like srun, depending on your cluster)

nukenadal commented 1 year ago

Thank you so much for your help! The issues of multi-processing and OOM errors are both resolved.