Closed nukenadal closed 1 year ago
Hi @nukenadal .
Thanks for your interest in our code and for trying this new feature!
The stress
branch is built for the defaults to be compatible with the current develop
branch of nequip
, soon to be released as 0.6.0. As a result, if your model is from an earlier version of nequip
, you'll need to specify pair_style allegro3232
on stress
instead of pair_style allegro
to get the compatible version.
Hopefully this resolves the issue, and please let me know if you observe any issues or suspicious results when using the stress
branch!
Thank you so much for your rapid response!
The above problem is resolved with the _pairstyle allegro3232. I tested an NpT MD run, but I found that the stresses are not predicted as how I expected. I used the normal npt
ensemble from LAMMPS as if I were using a classical potential, with the command: fix 1 all npt temp 200 200 $(100*dt) tri 1.0 1.0 $(50*dt)
. The aim here is to simulate the structure under ambient pressure (1 bar as I used the metal
unit). Below is an example of my training dataset header:
Lattice="12.002029185229819 0.004064393052959 -0.051196621492764 -0.000000000004469 11.711580380388035 -0.008099807301418 0.000000000001293 -0.000000000000898 12.039917140248585"
Properties=species:S:1:pos:R:3:forces:R:3 energy=-1141.403858931822
stress="0.004046574721405 -0.001809377097797 0.000241330359958 -0.001809377097797 0.000898277454365 -0.001171032253305 0.000241330359958 -0.001171032253305 0.000988749363509"
free_energy=-1141.403858931822 pbc="T T T"
Where the stresses are saved with the unit eV per angstrom cubed. Then in the MD production, I had a printout like this:
Step CPU Time Temp Press Volume TotEng
0 0.00 0.000 200.000 50652.238 47241.633 -29903.285
50 58.76 0.025 266.807 1806.511 62650.361 -30039.377
100 97.48 0.050 421.660 5194.959 106511.009 -30309.355
150 120.55 0.075 385.755 1111.530 156182.512 -30445.874
200 136.21 0.100 316.264 851.025 203929.625 -30503.778
250 148.06 0.125 259.946 552.863 267116.349 -30550.031
300 157.38 0.150 217.007 369.768 342194.282 -30582.112
350 164.71 0.175 183.242 217.801 420573.110 -30603.389
400 170.68 0.200 167.577 125.765 497445.470 -30614.017
450 175.88 0.225 165.960 75.794 573206.403 -30616.751
500 180.61 0.250 172.086 64.869 652145.704 -30615.140
550 185.00 0.275 178.895 58.633 741854.160 -30612.851
600 189.05 0.300 184.556 49.415 842463.753 -30610.630
650 192.79 0.325 190.739 40.879 958897.820 -30608.741
700 196.31 0.350 191.660 37.834 1086485.734 -30607.029
750 199.63 0.375 196.898 30.968 1229353.201 -30605.731
800 202.77 0.400 199.433 25.713 1388276.031 -30604.834
850 205.78 0.425 200.112 20.940 1559556.374 -30604.274
900 208.66 0.450 200.412 19.134 1743025.867 -30603.748
950 211.45 0.475 200.821 16.344 1946113.740 -30603.529
1000 214.18 0.500 202.782 11.614 2161252.664 -30603.517
1050 216.84 0.525 200.039 10.945 2394540.916 -30603.602
1100 219.44 0.550 198.836 9.568 2640828.010 -30603.645
1150 221.99 0.575 199.692 8.230 2906593.920 -30603.578
The pressure was initially quite high, and then lowered towards 1 bar. I was wondering if there is any additional setting that is required to reconcile the stresses from the FF and how LAMMPS handles the pressure of the cell? I used both the main branch of Nequip=0.5.6 and Allegro=0.2.0 if that is relevant.
Thank you!
Hi @nukenadal, you can have a try with the para_stress branch of Nequip and stress branch of pair_allegro and train with config.yaml including ParaStressForceOutput
instead of StressForceOutput
for your network. Details of the changes can be seen in PR_pair_allegro and PR_nequip and more discussions in this issue. It works well on my side and hope it helps.
@nukenadal : have you verified with a test set / validation set, such as with nequip-evaluate
, that your model's stress predictions are indeed correct?
It's hard for us to say without more details on the system, training data, and LAMMPS input file...
Sorry, I found that the above MD calculation had a mismatched FF and input structure. That is probably why the structure disintegrated rapidly. I retrained the FF with more data and epochs. It gives the following test results from nequip-evaluate
. The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?
f_mae = 0.010270
f_rmse = 0.014187
stress_mae = 0.000240
stress_rmse = 0.000424
e_mae = 0.032459
e/N_mae = 0.000338
Moreover, I tried re-compiling LAMMPS because I accidentally wiped out the build
folder. With the same modules and package versions (also with allegro3232 enabled), the MD production gives the following errors:
[cx3-11-8:2563622:0:2563622] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2563622) ====
0 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
1 0x0000000004c3b0c4 torch::jit::InterpreterStateImpl::callstack() ???:0
2 0x0000000004c3cbf1 torch::jit::InterpreterStateImpl::handleError() ???:0
3 0x0000000004c4ab85 torch::jit::InterpreterStateImpl::runImpl() ???:0
4 0x0000000004c3604f torch::jit::InterpreterState::run() ???:0
5 0x0000000004c264da torch::jit::GraphExecutorImplBase::run() ???:0
6 0x00000000048adf8e torch::jit::Method::operator()() ???:0
7 0x00000000007179ac torch::jit::Module::forward() /home/libtorch-new/include/torch/csrc/jit/api/module.h:114
8 0x0000000000720ee1 LAMMPS_NS::PairAllegro<(Precision)0>::compute() /home/lammps-stable_29Sep2021/src/pair_allegro.cpp:442
9 0x000000000055b366 LAMMPS_NS::Verlet::run() /home/lammps-stable_29Sep2021/src/verlet.cpp:312
10 0x00000000005163fb LAMMPS_NS::Run::command() /home/lammps-stable_29Sep2021/src/run.cpp:180
11 0x0000000000468e4a LAMMPS_NS::Input::execute_command() /home/lammps-stable_29Sep2021/src/input.cpp:794
12 0x0000000000469553 LAMMPS_NS::Input::file() /home/lammps-stable_29Sep2021/src/input.cpp:273
13 0x0000000000444b28 main() /home/lammps-stable_29Sep2021/src/main.cpp:98
14 0x0000000000023493 __libc_start_main() ???:0
15 0x000000000044502e _start() ???:0
=================================
/var/spool/PBS/mom_priv/jobs/7031777.pbs.SC: line 11: 2563622 Segmentation fault /home/lammps-stable_29Sep2021/build/lmp -in ./npt_org_perov.in > logfile_$PBS_JOBID
I tested multiple combinations of packages but this issue remains. I was wondering if you have any suggestions on this type of error? Thanks so much!
Sorry, I found that the above MD calculation had a mismatched FF and input structure.
👍
The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?
This is an error of less than 1%, which sounds excellent; in an absolute sense it also looks good to me... @simonbatzner ?
Regarding the segfault, in general we've seen this either in certain cases if you run out of memory, or if there is an empty simulation domain in a parallel LAMMPS simulation... do either of these sound applicable?
With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories. Thank you so much for the valuable advice!
Some extra points are, though not quite relevant,
Traceback (most recent call last):
File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 604, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
I wonder if this is telling me that I don't have enough storage space for some temporary data or a memory error like above?
Hi @nukenadal ,
With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories.
Great! This should also be possible to resolve, then, by running on more GPUs with the same model.
The MD production with 1 GPU is actually faster than using 2 GPUs. I am not sure if this is related to my network setting, making it not very scalable. Or this is somehow a feature of this experimental version?
You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?
I also tried this workflow on my local workstation. I got the following error when trying to load the dataset for training: ... OSError
This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.
Hi,
You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?
I used about 2600 atoms again. I was comparing between 1 GPU on 1 node and 2 GPUs on 1 node. I compared the speed through the time taken for the same 2000 MD steps. I have got the following results:
for 1 GPU:
Loop time of 460.862 on 1 procs for 2000 steps with 2592 atoms
Performance: 0.187 ns/day, 128.017 hours/ns, 4.340 timesteps/s
99.5% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 458.13 | 458.13 | 458.13 | 0.0 | 99.41
Neigh | 2.0769 | 2.0769 | 2.0769 | 0.0 | 0.45
Comm | 0.047307 | 0.047307 | 0.047307 | 0.0 | 0.01
Output | 0.39902 | 0.39902 | 0.39902 | 0.0 | 0.09
Modify | 0.19997 | 0.19997 | 0.19997 | 0.0 | 0.04
Other | | 0.01258 | | | 0.00
Nlocal: 2592.00 ave 2592 max 2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 3591.00 ave 3591 max 3591 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0.00000 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 123132.0 ave 123132 max 123132 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 123132
Ave neighs/atom = 47.504630
Neighbor list builds = 174
Dangerous builds = 96
Total wall time: 0:07:58
And for 2 GPUs:
Loop time of 459.87 on 1 procs for 2000 steps with 2592 atoms
Performance: 0.188 ns/day, 127.742 hours/ns, 4.349 timesteps/s
20.3% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 457.31 | 457.31 | 457.31 | 0.0 | 99.44
Neigh | 2.0862 | 2.0862 | 2.0862 | 0.0 | 0.45
Comm | 0.051038 | 0.051038 | 0.051038 | 0.0 | 0.01
Output | 0.19311 | 0.19311 | 0.19311 | 0.0 | 0.04
Modify | 0.21184 | 0.21184 | 0.21184 | 0.0 | 0.05
Other | | 0.01453 | | | 0.00
Nlocal: 2592.00 ave 2592 max 2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 3592.00 ave 3592 max 3592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0.00000 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 123404.0 ave 123404 max 123404 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 123404
Ave neighs/atom = 47.609568
Neighbor list builds = 175
Dangerous builds = 97
Total wall time: 0:08:05
I wonder if this allows us to find out why the scaling is abnormal? As the percentage of time spent on Comm was similar.
This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.
This happened before the line "Loaded Dataset", as the first few lines in the log are:
Torch device: cuda
Processing dataset...
terminate called after throwing an instance of 'c10::Error'
20.3% CPU use with 1 MPI tasks x 1 OpenMP threads
^ from the second log
@nukenadal are you sure it is using two GPUs? have you checked nvtop
while it's running, for example?
Regarding OOM, that looks like preprocessing, yes. You can preprocess your dataset on a normal CPU node (where hopefully you can allocate more RAM) using nequip-benchmark
on the same YAML config, see https://github.com/mir-group/nequip/issues/293, for example, for more discussion.
are you sure it is using two GPUs? have you checked nvtop while it's running, for example?
I found that I may not have properly set the multi-GPU process, I wonder what command should I use to call say 2 GPUs on the same node? Is it just the same with mpirun -n x lmp -in lmp.in
?
Yes (or possibly a launcher like srun
, depending on your cluster)
Thank you so much for your help! The issues of multi-processing and OOM errors are both resolved.
Hi Allegro developers,
Thank you for making the update on the stress branch of pair_allegro. I am trying to install this new branch to allow stress prediction in my MD production. Previously I have used the main branch and made several successful runs on NVT MD. But with this stress branch, some errors occurred when producing the NpT MD trajectories.
I used the following versions of the packages: torch 1.11.0 libtorch 1.13.0+cu116 (since the documentation says avoid 1.12) CUDA 11.4.2 cuDNN 8.5.0.96
The LAMMPS was successfully compiled according to the documentation, the FF was trained with stresses included. In the following MD run, I got LAMMPS output as:
It stopped with this error:
I wonder if this is because I used the wrong combination of packages that the new stress branch is not compatible with?
Thank you so much.