QUIP/LAMMPS different energies for different number of cores

robertstella111 commented 2 months ago

Hello,

I am attempting to create a GAP potential for silicon and carbon. I have installed the QUIP library with the architecture linux_x86_64_gfortran. For the training of the GAP I am using the following command

gap_fit energy_parameter_name=energy force_parameter_name=forces do_copy_at_file=F sparse_separate_file=T gp_file=GAP.xml at_file=train.xyz gap={distance_Nb order=2 cutoff=4.5 n_sparse=15 covariance_type=ard_se delta=5 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : distance_Nb order=3 cutoff=3.5 n_sparse=200 covariance_type=ard_se delta=0.3 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : soap atom_sigma=0.5 l_max=8 n_max=8 cutoff=4.5 cutoff_transition_width=1.0 delta=0.1 covariance_type=dot_product n_sparse=2000 zeta=4} default_sigma={0.005 0.08 0.0 0.0} config_type_sigma={dimer:0.0008:0.02:0.0:0.0:single:0.0001:0.005:0.0:0.0:perfect_crystall:0.00001:0.01:0.0:0.0:interstitials:0.0002:0.05:0.0:0.0} sparse_jitter=1e-8 gp_file=GAP.xml

I now want to run bigger simulations in lammps. For lammps I tried several versions (lammps_stable_sep_2021, stable_5Jun2019,stable_29Aug2024) and installed the QUIP pair_style by using cmake directly and also by referring to the libquip.a library.

When I now run lammps with mpi on mulitple cores and on a single core the results for the energies are quite different. For a cell of 385 atoms, the energy difference is around 3eV. The command for the pair_style looks as follows

pair_coeff GAP/29_08/without_stress/GAP.xml "Potential xml_label=GAP_2024_8_30_120_8_42_16_201" 14 6

If I now change all silicon atoms to carbon or all carbon to silicon the results for the mpi and single core runs are the same. I also encountered the problem for bigger cells and smaller cells, with defects and without defects. The problem with different results for different numbers of cores is also not present for all GAP potentials I have trained so far. The one above is just an example of where the error occurs.

Does someone have an idea where the error comes from and if this will influence the results I get from longer and more realistic runs (physical wise)?

log_serial.txt log_parallel.txt training.txt h_110_1.txt

bernstei commented 2 months ago

Really sounds like a bug in the LAMMPS-QUIP interface. Have you compared to using quippy.potential.Potential to evaluate the system, to see which value, if any, is correct?

albapa commented 2 months ago

Can you try the same experiment with the SW parameter file at QUIP/share/Parameters/ip.parms.SW_SiC_CASTEP_elastic.xml? You'd need something like

pair_coeff * * ip.parms.SW_SiC_CASTEP_elastic.xm "IP SW" 14 6

robertstella111 commented 2 months ago

If I calculate the energy for the system with the following command

quip E=T atoms_filename=test.xyz param_filename=GAP.xml | grep AT | sed 's/AT//' > gap.xyz

I get an energy of -2768.55 eV. This result is closer to the single core result in lammps of -2768.9 eV. The parallel run in lammps yields -2771.475 eV. The output files are attached in the initial question. I used the initial configuration. If I use the same type of atoms all three energies are the same.

gap_output.txt

When I use the same lammps input script with the SW parameter file ip.parms.SW_SiC_CASTEP_elastic.xml I get no core energy dependence. This is also something I recognized for my trained potentials. Some of them have a core dependence and some don't.

Do you think this has something to do with QUIP version and I could get rid of it by installing an older version because I don't think it has something to do with the lammps version.

Thank you in advance. log_serial.txt log_parallel.txt

bernstei commented 2 months ago

Those are way larger errors than can be explained by anything other than a bug. If we can get a complete setup to reproduce the issue that'd help. Can you can reproduce this behavior with a GAP xml (+sparsepoints) files we have access to (e.g. one of the published publicly available ones), or are you willing to share the one you created?

Also, can you figure out how small a system you can reproduce this issue with?

robertstella111 commented 2 months ago

I uploaded all GAP files in the following git repository

https://github.com/robertstella111/SiC_GAP_bug.git

Do you also need the trainingsdata or is this enough? I have trained the potential with the commands written in the previous messages.

For system sizes of 5x5x10 (48 atoms) Angstrom the error is still there, but also scaled down as the system size.

bernstei commented 2 months ago

Well, I've definitely reproduced the error, FWIW. I'll see if I can figure out what's going on.

bernstei commented 2 months ago

@albapa @jameskermode do you remember how the lammps interface is supposed to behave with atom_local_mask? In particular, are atoms with that mask set to False supposed to contribute to the total energy?

albapa commented 2 months ago

I don't think so. I think we should only calculate descriptors of atoms with atom_local_mask==.true. and derivatives with respect to all, regardless of the mask. So only these atoms get a local energy, which is then accumulated by LAMMPS.

bernstei commented 2 months ago

Something weird is going on. I have two input files, one of which lammps serial and mpi agree, and a slightly different one (change one atom type) where they do not. Neither agrees with pure QUIP. And in both cases atoms with local mask false have non-zero energy, which appears to contribute to the LAMMPS reported PE. More debugging in a few hours, but I have to go out now.

bernstei commented 2 months ago

And if I only add the ones where the mask is True, I get energies that are very different.

        out.bad.quip    TF -1933.0435
         out.bad.mpi    TF -1933.2943  T only -1723.0396
      out.bad.serial    TF -1933.2522  T only -1776.5271

       out.good.quip    TF -1929.8993
        out.good.mpi    TF -1930.1078  T only -1720.6466
     out.good.serial    TF -1930.1078  T only -1773.8421

"good" is the input configuration where LAMMPS serial and mpi agree, "bad" is the one where they don't. "TF" is adding all the local energies, "T only" is only the ones where mask is True. These are manual sums of the local energies from QUIP, and the LAMMPS ones agree with the potential energy that LAMMPS prints out in its log file.

bernstei commented 2 months ago

Is there a chance that atom_mask is incorrectly implemented for some particular descriptor, which this GAP is using?

bernstei commented 2 months ago

A couple of weird things.

there's no e0 value for either 8 or 14
I tried to produce stripped down versions of the GAP.xml file with a subset of the descriptors. When I remove the two soaps (and reduce n_coordinate from 9 to 7), it runs. When I remove all the distance_NB, leaving only the 2 soaps, the xml fails to parse.

any guesses?

[added later] I can also read a 2b-only xml file, but not a 3b-only one.

albapa commented 2 months ago

I had the same idea - to doctor the XML to figure out which descriptor is causing the problem. Happy to look at it.

bernstei commented 2 months ago

I'm trying to debug the xml parsing. BTW, If I remove the SOAP descriptors the energy value mismatch is still there (i.e 2b+3b). If I do 2b only, there's no mismatch. I think that means there's definitely an issue with the 3b, but I can't prove there isn't one with soap yet.

robertstella111 commented 2 months ago

As far as I remember I recently trained a potential without the 3b descriptor and there was still a missmatch. I will check check that in the next couple of hours when I am home again. I will let you know afterwars and if I remembered that correctly.

bernstei commented 2 months ago

OK, figured out the xm parsing issue - the number of the coordinate is embedded in its label. Let me test further.

bernstei commented 2 months ago

2b seems fine (QUIP and LAMMPS serial and mpi agree). 3b is bad. Just soap also appears to be OK.

Note that 2b and 3b are distance_Nb order=, not native 2b/3b.

It is, frankly, not entirely obvious to me how one would handle the local mask for distance_Nb

albapa commented 2 months ago

It is, frankly, not entirely obvious to me how one would handle the local mask for distance_Nb

What I think should be done is to distribute the energy term in N equal parts and assign them to the constituent N atoms. Only the local atoms carry this contribution to LAMMPS.

I can have a look to see whether this is what happens in practice.

Did you test the forces? Are they consistent across the different tests?

bernstei commented 2 months ago

I haven't looked at forces at all. When happens now with distance_Nb is that every atom in the cluster gets some local energy. LAMMPS adds them up correctly. But for some reason they don't sum to the same thing in both serial and parallel, and neither sums to what quip gets. I'm going to try to label the local energies print inside QUIP with the LAMMPS identities, so I can at least attribute the energy to the original source of each ghost atom. I assume/hope that most of them will give the same atomic energies as quip is getting.

[for some reason I don't get any evidence of lmptag existing in the QUIP call, and I can't get any debugging outout from quip_lammps_wrapper] [never mind - I need to run from lammps to get lammps things to happen]

bernstei commented 2 months ago

Is it possible there's an issue with compact_clusters?

bernstei commented 2 months ago

sorry about the false alarms - there was a bug extracting a small cluster from the large file that really gives an error.

bernstei commented 2 months ago

OK, some real issues. When I look at 3-body contributions from particular triplets of atoms, and compare what the quip call and lammps call do, I see some differences. For example, in this particular geometry file (modified from the one above), for a particular triplet of atoms (73 217 265) and a particular descriptor (C-C-Si), with quip I get

tttt.quip:triplet sorted 73 217 265 triplet orig 73 217 265 E 0.006070448733738178 xStar ['3.1594797051795842', '1.8939224662994143', '5.0534005965051456']

and with lammps I get

tttt.serial:triplet sorted 73 217 265 triplet orig 73 265 217 E -0.004993706593420856 xStar ['1.8939224729981241', '3.1594796990820622', '5.0534005971059095']

Basically, it looks like it encounters this particular triplet with its atoms in a different order (triplet orig. triplet sorted is just so it's easier for me to grep), and ends up with different E contributions. I don't really understand the actual prediction code, so I don't know how this happens. I do see that the xStar is permuted, but not in the same way as the indices of the atoms in the triplet

bernstei commented 2 months ago

And the bottom line is that atom 73 ends up with non-trivially different energy contribution between quip and lammps serial. This contribution differs by about 10 meV, and overall there's 19 meV difference in the total site energy of atom 73.

There are maybe 10-20 such atoms overall, and a total energy difference of a fraction of an eV.

If there's anything else useful I can provide (like these xyz/lammps-data configs and lammps input files), let me know.

bernstei commented 2 months ago

Note that sometimes they encounter the same triplet multiple times, and I see something similar. If it hits the same set of atoms multiple times, if the atoms are in the same order in the triplet I see identical xStar and equal energy contributions, but if they are different, I see permuted xStar and different energy contributions. E.g. 73 217 259

tttt.quip:triplet sorted 73 217 259 triplet orig 73 217 259 E 0.0035773992912006454 xStar ['3.1594797051795842', '1.8894985253530610', '3.0973863825194861']
tttt.quip:triplet sorted 73 217 259 triplet orig 217 259 73 E 0.0027544627496734315 xStar ['3.0973863825194861', '3.1594797051795842', '1.8894985253530610']
tttt.quip:triplet sorted 73 217 259 triplet orig 259 217 73 E 0.003580516270096956 xStar ['3.0973863825194861', '1.8894985253530610', '3.1594797051795842']

vs.

tttt.serial:triplet sorted 73 217 259 triplet orig 73 259 217 E -0.0037150381945537767 xStar ['1.8894985269571967', '3.1594796990820622', '3.0973863820152188']
tttt.serial:triplet sorted 73 217 259 triplet orig 217 259 73 E 0.0027544628270209925 xStar ['3.0973863820152188', '3.1594796990820622', '1.8894985269571967']
tttt.serial:triplet sorted 73 217 259 triplet orig 259 217 73 E 0.003580516296070547 xStar ['3.0973863820152188', '1.8894985269571967', '3.1594796990820622']

bernstei commented 2 months ago

I don't really understand how this works, and they they might hit the same triplet multiple times, given that there's explicit handling of permutations inside the N-body descriptor, so I'm not sure where to go next.

albapa commented 2 months ago

I'll try to retrace my steps.

bernstei commented 2 months ago

I wonder if they issue has to do with permutations, and in particular how they affect kernels that involve multiple elements. If it's a C-C-Si 3-body kernel, are you only allowed to permute two atoms if they are both C?

albapa commented 2 months ago

Yes, this is exactly what I am looking at.

And yes, for a C-C-Si there are two permutations. I want to see if some problem sneaked in the way we treat that.

bernstei commented 2 months ago

This should get fixed by https://github.com/libAtoms/GAP/pull/91, right @albapa ?

albapa commented 2 months ago

This should get fixed by libAtoms/GAP#91, right @albapa ?

Yes! It will break backwards compatibility, but I think potentials fitted with earlier versions (distance_nb order>2) may be compromised as well, so will need to be refitted.

libAtoms / QUIP

QUIP/LAMMPS different energies for different number of cores #669