Open robertstella111 opened 2 months ago
Really sounds like a bug in the LAMMPS-QUIP interface. Have you compared to using quippy.potential.Potential
to evaluate the system, to see which value, if any, is correct?
Can you try the same experiment with the SW parameter file at QUIP/share/Parameters/ip.parms.SW_SiC_CASTEP_elastic.xml
? You'd need something like
pair_coeff * * ip.parms.SW_SiC_CASTEP_elastic.xm "IP SW" 14 6
If I calculate the energy for the system with the following command
quip E=T atoms_filename=test.xyz param_filename=GAP.xml | grep AT | sed 's/AT//' > gap.xyz
I get an energy of -2768.55 eV. This result is closer to the single core result in lammps of -2768.9 eV. The parallel run in lammps yields -2771.475 eV. The output files are attached in the initial question. I used the initial configuration. If I use the same type of atoms all three energies are the same.
When I use the same lammps input script with the SW parameter file ip.parms.SW_SiC_CASTEP_elastic.xml I get no core energy dependence. This is also something I recognized for my trained potentials. Some of them have a core dependence and some don't.
Do you think this has something to do with QUIP version and I could get rid of it by installing an older version because I don't think it has something to do with the lammps version.
Thank you in advance. log_serial.txt log_parallel.txt
Those are way larger errors than can be explained by anything other than a bug. If we can get a complete setup to reproduce the issue that'd help. Can you can reproduce this behavior with a GAP xml (+sparsepoints) files we have access to (e.g. one of the published publicly available ones), or are you willing to share the one you created?
Also, can you figure out how small a system you can reproduce this issue with?
I uploaded all GAP files in the following git repository
https://github.com/robertstella111/SiC_GAP_bug.git
Do you also need the trainingsdata or is this enough? I have trained the potential with the commands written in the previous messages.
For system sizes of 5x5x10 (48 atoms) Angstrom the error is still there, but also scaled down as the system size.
Well, I've definitely reproduced the error, FWIW. I'll see if I can figure out what's going on.
@albapa @jameskermode do you remember how the lammps interface is supposed to behave with atom_local_mask
? In particular, are atoms with that mask set to False supposed to contribute to the total energy?
I don't think so. I think we should only calculate descriptors of atoms with atom_local_mask==.true.
and derivatives with respect to all, regardless of the mask. So only these atoms get a local energy, which is then accumulated by LAMMPS.
Something weird is going on. I have two input files, one of which lammps serial and mpi agree, and a slightly different one (change one atom type) where they do not. Neither agrees with pure QUIP. And in both cases atoms with local mask false have non-zero energy, which appears to contribute to the LAMMPS reported PE. More debugging in a few hours, but I have to go out now.
And if I only add the ones where the mask is True, I get energies that are very different.
out.bad.quip TF -1933.0435
out.bad.mpi TF -1933.2943 T only -1723.0396
out.bad.serial TF -1933.2522 T only -1776.5271
out.good.quip TF -1929.8993
out.good.mpi TF -1930.1078 T only -1720.6466
out.good.serial TF -1930.1078 T only -1773.8421
"good" is the input configuration where LAMMPS serial and mpi agree, "bad" is the one where they don't. "TF" is adding all the local energies, "T only" is only the ones where mask is True. These are manual sums of the local energies from QUIP, and the LAMMPS ones agree with the potential energy that LAMMPS prints out in its log file.
Is there a chance that atom_mask
is incorrectly implemented for some particular descriptor, which this GAP is using?
A couple of weird things.
n_coordinate
from 9 to 7), it runs. When I remove all the distance_NB, leaving only the 2 soaps, the xml fails to parse. any guesses?
[added later] I can also read a 2b-only xml file, but not a 3b-only one.
I had the same idea - to doctor the XML to figure out which descriptor is causing the problem. Happy to look at it.
I'm trying to debug the xml parsing. BTW, If I remove the SOAP descriptors the energy value mismatch is still there (i.e 2b+3b). If I do 2b only, there's no mismatch. I think that means there's definitely an issue with the 3b, but I can't prove there isn't one with soap yet.
As far as I remember I recently trained a potential without the 3b descriptor and there was still a missmatch. I will check check that in the next couple of hours when I am home again. I will let you know afterwars and if I remembered that correctly.
OK, figured out the xm parsing issue - the number of the coordinate is embedded in its label. Let me test further.
2b seems fine (QUIP and LAMMPS serial and mpi agree). 3b is bad. Just soap also appears to be OK.
Note that 2b and 3b are distance_Nb order=
, not native 2b/3b.
It is, frankly, not entirely obvious to me how one would handle the local mask for distance_Nb
It is, frankly, not entirely obvious to me how one would handle the local mask for distance_Nb
What I think should be done is to distribute the energy term in N
equal parts and assign them to the constituent N
atoms. Only the local atoms carry this contribution to LAMMPS.
I can have a look to see whether this is what happens in practice.
Did you test the forces? Are they consistent across the different tests?
I haven't looked at forces at all. When happens now with distance_Nb
is that every atom in the cluster gets some local energy. LAMMPS adds them up correctly. But for some reason they don't sum to the same thing in both serial and parallel, and neither sums to what quip gets. I'm going to try to label the local energies print inside QUIP with the LAMMPS identities, so I can at least attribute the energy to the original source of each ghost atom. I assume/hope that most of them will give the same atomic energies as quip is getting.
[for some reason I don't get any evidence of lmptag
existing in the QUIP call, and I can't get any debugging outout from quip_lammps_wrapper
] [never mind - I need to run from lammps to get lammps things to happen]
Is it possible there's an issue with compact_clusters?
sorry about the false alarms - there was a bug extracting a small cluster from the large file that really gives an error.
OK, some real issues. When I look at 3-body contributions from particular triplets of atoms, and compare what the quip call and lammps call do, I see some differences. For example, in this particular geometry file (modified from the one above), for a particular triplet of atoms (73 217 265) and a particular descriptor (C-C-Si), with quip I get
tttt.quip:triplet sorted 73 217 265 triplet orig 73 217 265 E 0.006070448733738178 xStar ['3.1594797051795842', '1.8939224662994143', '5.0534005965051456']
and with lammps I get
tttt.serial:triplet sorted 73 217 265 triplet orig 73 265 217 E -0.004993706593420856 xStar ['1.8939224729981241', '3.1594796990820622', '5.0534005971059095']
Basically, it looks like it encounters this particular triplet with its atoms in a different order (triplet orig
. triplet sorted
is just so it's easier for me to grep), and ends up with different E contributions. I don't really understand the actual prediction code, so I don't know how this happens. I do see that the xStar is permuted, but not in the same way as the indices of the atoms in the triplet
And the bottom line is that atom 73 ends up with non-trivially different energy contribution between quip and lammps serial. This contribution differs by about 10 meV, and overall there's 19 meV difference in the total site energy of atom 73.
There are maybe 10-20 such atoms overall, and a total energy difference of a fraction of an eV.
If there's anything else useful I can provide (like these xyz/lammps-data configs and lammps input files), let me know.
Note that sometimes they encounter the same triplet multiple times, and I see something similar. If it hits the same set of atoms multiple times, if the atoms are in the same order in the triplet I see identical xStar and equal energy contributions, but if they are different, I see permuted xStar and different energy contributions. E.g. 73 217 259
tttt.quip:triplet sorted 73 217 259 triplet orig 73 217 259 E 0.0035773992912006454 xStar ['3.1594797051795842', '1.8894985253530610', '3.0973863825194861']
tttt.quip:triplet sorted 73 217 259 triplet orig 217 259 73 E 0.0027544627496734315 xStar ['3.0973863825194861', '3.1594797051795842', '1.8894985253530610']
tttt.quip:triplet sorted 73 217 259 triplet orig 259 217 73 E 0.003580516270096956 xStar ['3.0973863825194861', '1.8894985253530610', '3.1594797051795842']
vs.
tttt.serial:triplet sorted 73 217 259 triplet orig 73 259 217 E -0.0037150381945537767 xStar ['1.8894985269571967', '3.1594796990820622', '3.0973863820152188']
tttt.serial:triplet sorted 73 217 259 triplet orig 217 259 73 E 0.0027544628270209925 xStar ['3.0973863820152188', '3.1594796990820622', '1.8894985269571967']
tttt.serial:triplet sorted 73 217 259 triplet orig 259 217 73 E 0.003580516296070547 xStar ['3.0973863820152188', '1.8894985269571967', '3.1594796990820622']
I don't really understand how this works, and they they might hit the same triplet multiple times, given that there's explicit handling of permutations inside the N-body descriptor, so I'm not sure where to go next.
I'll try to retrace my steps.
I wonder if they issue has to do with permutations, and in particular how they affect kernels that involve multiple elements. If it's a C-C-Si 3-body kernel, are you only allowed to permute two atoms if they are both C?
Yes, this is exactly what I am looking at.
And yes, for a C-C-Si there are two permutations. I want to see if some problem sneaked in the way we treat that.
This should get fixed by https://github.com/libAtoms/GAP/pull/91, right @albapa ?
This should get fixed by libAtoms/GAP#91, right @albapa ?
Yes! It will break backwards compatibility, but I think potentials fitted with earlier versions (distance_nb order>2
) may be compromised as well, so will need to be refitted.
Hello,
I am attempting to create a GAP potential for silicon and carbon. I have installed the QUIP library with the architecture linux_x86_64_gfortran. For the training of the GAP I am using the following command
gap_fit energy_parameter_name=energy force_parameter_name=forces do_copy_at_file=F sparse_separate_file=T gp_file=GAP.xml at_file=train.xyz gap={distance_Nb order=2 cutoff=4.5 n_sparse=15 covariance_type=ard_se delta=5 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : distance_Nb order=3 cutoff=3.5 n_sparse=200 covariance_type=ard_se delta=0.3 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : soap atom_sigma=0.5 l_max=8 n_max=8 cutoff=4.5 cutoff_transition_width=1.0 delta=0.1 covariance_type=dot_product n_sparse=2000 zeta=4} default_sigma={0.005 0.08 0.0 0.0} config_type_sigma={dimer:0.0008:0.02:0.0:0.0:single:0.0001:0.005:0.0:0.0:perfect_crystall:0.00001:0.01:0.0:0.0:interstitials:0.0002:0.05:0.0:0.0} sparse_jitter=1e-8 gp_file=GAP.xml
I now want to run bigger simulations in lammps. For lammps I tried several versions (lammps_stable_sep_2021, stable_5Jun2019,stable_29Aug2024) and installed the QUIP pair_style by using cmake directly and also by referring to the libquip.a library.
When I now run lammps with mpi on mulitple cores and on a single core the results for the energies are quite different. For a cell of 385 atoms, the energy difference is around 3eV. The command for the pair_style looks as follows
pair_coeff GAP/29_08/without_stress/GAP.xml "Potential xml_label=GAP_2024_8_30_120_8_42_16_201" 14 6
If I now change all silicon atoms to carbon or all carbon to silicon the results for the mpi and single core runs are the same. I also encountered the problem for bigger cells and smaller cells, with defects and without defects. The problem with different results for different numbers of cores is also not present for all GAP potentials I have trained so far. The one above is just an example of where the error occurs.
Does someone have an idea where the error comes from and if this will influence the results I get from longer and more realistic runs (physical wise)?
log_serial.txt log_parallel.txt training.txt h_110_1.txt