libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io
348 stars 122 forks source link

Random behavior of gap_fit leading to NaNs #515

Closed mcaroba closed 2 years ago

mcaroba commented 2 years ago

Hi,

We added some new config types to one of the databases we're using to train a CH potential, and this triggered NaNs appearing in the alphas. As far as I can tell this is completely random and everything I did to track the possible origin doesn't check out. The behavior is so random that when I change the name of the XYZ file the NaNs may go away. Minimal working example to reproduce the issue:

  1. Unzip the attached files
  2. Run ./train.sh
  3. Check for NaNs in CH.xml
  4. Do mv h_press_mix.xyz db.xyz and change the atoms_filename keyword in train.sh to db.xyz
  5. Run ./train.sh
  6. Check for NaNs in CH.xml - they're gone! What???

files.zip

Can someone please explain to me what is going on? I set rnd_seed=10 to avoid the effects of randomness on the results. I am very confused about this, and somehow feel that it's either something really obvious that I have become blind to after running the example a million times or something highly non trivial.

mcaroba commented 2 years ago

@albapa @jameskermode This is what I was talking about.

jameskermode commented 2 years ago

Runs fine with no NaNs for me with released gap_fit from pip, versions 0.9.8 and 0.9.9 (latest). What versions of QUIP and GAP are you using? Latest Git? Can you give the SHA1 hashes?

mcaroba commented 2 years ago

@jameskermode Yes, I pulled it just today from git. Sorry, I don't know how to get the SHA1 hash, this is the version: http://github.com/libatoms/quip.git,608d2a1b4-dirty (git clone --recursive). But we got NaNs across a number of versions and platforms. I am pasting a bigger database here, maybe that triggers the NaNs for you? db.tar.gz

jameskermode commented 2 years ago

Bigger example still works fine for me with the pip version. Will try the git version: the SHA1 hash is just the 608d2a1b4 you pasted.

jameskermode commented 2 years ago

Still fine with that git version of QUIP and linked submodules. I'm using linux_x86_64_gfortran_openmp with gfortran 8.3 and OpenBLAS 0.3.7. Are you doing something significantly different? Here's the resulting XML, sparseX and logfile with your larger database. files.zip

If you want to debug locally I suggest recompiling with -ffpe-trap=invalid in the F95FLAGS and then gap_fit should stop at the first NaN; if you run inside gdb you can get a backtrace of where the problem first occurs.

mcaroba commented 2 years ago

We got problems with regular BLAS + gfortran and Intel MKL + ifort. I will do those checks and try with OpenBLAS. Hopefully we can isolate the issue.

mcaroba commented 2 years ago

Found it! It seems to be a bug in gp_predict.f95 in the definition of the COVARIANCE_PP covariance type, when the gradients are computed. The issue is on line 2259 of that file:

        if( ( present(grad_Covariance_i) .or. present(grad_Covariance_j) ) .and. (r_ij > 0.0_dp) ) then
           grad_covariancePP_ij = grad_covariancePP(r_ij,PP_Q, this%d) / r_ij
           !xI_xJ(:) = x_i(this%permutations(:,i_p)) - x_j(:)

           if(present(grad_Covariance_i)) &
              grad_Covariance_i(this%permutations(:,i_p)) = grad_Covariance_i(this%permutations(:,i_p)) + grad_covariancePP_ij * xI_xJ(:)

           if(present(grad_Covariance_j)) &
              grad_Covariance_j(:) = grad_Covariance_j(:) - grad_covariancePP_ij * xI_xJ(:)
        endif

The definition of xI_xJ is commented out in that line, but this variable is subsequently used to define the grad_Covariance_i and grad_Covariance_j variables. I do not know why it's commented out (should it be always zero? any thoughs @albapa ?) but when using optimization flags there's no guarantee that when undefined it will be zero, it could contain whatever random bits of information are contained in the memory at that time.

Does this analysis make sense to you guys?

@albapa What should be done here? Can this be uncomented or replaced by something else?

albapa commented 2 years ago

Good job finding this! I'll take a look.

jameskermode commented 2 years ago

Yes, sounds very plausible. Did the -ffpe-trap=invalid help with tracking it down?

albapa commented 2 years ago

I have to say I don't see how it could have worked at all with this commented out... Can you try uncommenting it?

albapa commented 2 years ago

@gabor1 can you point me to where the previous repo was? I am trying to find the original commit to see if there was a particular reason to comment out this line.

jameskermode commented 2 years ago

https://github.com/libAtoms/GAP-private-archived

mcaroba commented 2 years ago

I have to say I don't see how it could have worked at all with this commented out... Can you try uncommenting it?

@albapa When I uncomment it it runs without NaN, but I haven't checked that the numbers make sense. I have my own independent implementation of this covariance in TurboGAP and when I first implemented it I checked the numbers are the same as from QUIP. So unless the code has changed since, it must not affect the results.

Yes, sounds very plausible. Did the -ffpe-trap=invalid help with tracking it down?

@jameskermode It helped somewhat, since it stopped the code execution precisely after attempting to build the angle_3b covariance matrix. I am naturally suspicious about the soap_turbo implementation so this definitely helped narrow down the problem. I added a bunch of debug flags to get the code line triggering the issue, which was not the one I highlighted, but a more general part of the code used to build covariance matrices (independent of the PP covariance type). From the info 3b + PP I knew where to look in the code for a possible bug.

albapa commented 2 years ago

I have to say I don't see how it could have worked at all with this commented out... Can you try uncommenting it?

@albapa When I uncomment it it runs without NaN, but I haven't checked that the numbers make sense. I have my own independent implementation of this covariance in TurboGAP and when I first implemented it I checked the numbers are the same as from QUIP. So unless the code has changed since, it must not affect the results.

The covariance and its derivative (w.r.t to the distance) should be okay, but this is definitely a bug - and it is w.r.t. the individual descriptor components. With the commented out (original) version I get a whole lot of nonsense...

mcaroba commented 2 years ago

I have to say I don't see how it could have worked at all with this commented out... Can you try uncommenting it?

@albapa When I uncomment it it runs without NaN, but I haven't checked that the numbers make sense. I have my own independent implementation of this covariance in TurboGAP and when I first implemented it I checked the numbers are the same as from QUIP. So unless the code has changed since, it must not affect the results.

The covariance and its derivative (w.r.t to the distance) should be okay, but this is definitely a bug - and it is w.r.t. the individual descriptor components. With the commented out (original) version I get a whole lot of nonsense...

I have tested the effects of uncommenting and setting this line to zero, and it seems to not affect the results at prediction time (quip ... e f) with one of our existing potentials which incorporates this type of covariance. The results are the same as with TurboGAP too. I can't check training since TurboGAP can't do that. We have also trained many potentials with this bug and 1) they give very good results and 2) we only got NaNs this and the past week, with this particular addon to our database. This all puzzles me. If the bug only affects training and not prediction, then our potentials should be affected. Could it be that the 2b and soap_turbo GAPs are able to compensate for the buggy 3b GAP when it gives numbers instead of NaNs? This sounds really unlikely to me.

albapa commented 2 years ago

It won't affect anything at prediction time - this code block is only used to build up the covariance matrix for the training.

Were your previous kernels also PP? Because there is another problem here: the gradient kernel isn't multiplied by delta**2...

mcaroba commented 2 years ago

It won't affect anything at prediction time - this code block is only used to build up the covariance matrix.

Were your previous kernels also PP? Because there is another problem here: the gradient kernel isn't multiplied by delta**2...

All our carbon potentials use this covariance for the 3b terms. We don't use 3b terms for other materials. Again, it puzzles me that we have been able to fit carbon GAP potentials which are significantly better than others if this bug significantly affects training. E.g., I choose delta = 0.01 for the 3b terms. If (implicitly) the training code is using delta = 1 (because the delta**2 term is missing) then there should be a factor of 10000 disagrement between the 3b force at training and at prediction times (I suppose 3b energies are not affected since this is only in the gradient part of the code). Once you propose a fix for this I will retrain one of the carbon GAPs and check whether the force RMSEs are going down.

albapa commented 2 years ago
diff --git a/gp_predict.f95 b/gp_predict.f95
index b90566b..60ee7d4 100644
--- a/gp_predict.f95
+++ b/gp_predict.f95
@@ -2255,8 +2255,8 @@ module gp_predict_module
             gpCoordinates_Covariance = gpCoordinates_Covariance + covariancePP_ij

             if( ( present(grad_Covariance_i) .or. present(grad_Covariance_j) ) .and. (r_ij > 0.0_dp) ) then
-               grad_covariancePP_ij = grad_covariancePP(r_ij,PP_Q, this%d) / r_ij
-               !xI_xJ(:) = x_i(this%permutations(:,i_p)) - x_j(:)
+               grad_covariancePP_ij = this%delta**2 * grad_covariancePP(r_ij,PP_Q, this%d) / r_ij
+               xI_xJ(:) = x_i(this%permutations(:,i_p)) - x_j(:)

                if(present(grad_Covariance_i)) &
                   grad_Covariance_i(this%permutations(:,i_p)) = grad_Covariance_i(this%permutations(:,i_p)) + grad_covariancePP_ij * xI_xJ(:)
albapa commented 2 years ago

Try this fix. I am not sure how the two unrelated bugs affected each other but I don't think any of the gradient part was particularly meaningful.

mcaroba commented 2 years ago

Try this fix. I am not sure how the two unrelated bugs affected each other but I don't think any of the gradient part was particularly meaningful.

What about the following code for ..._ii and ..._jj? They are missing delta for both the covariance and the gradient. Should it be there?

albapa commented 2 years ago

That part only gets executed if the kernel is not permutationally invariant, therefore needs to be summed up over the possible permutations, and needs normalisation after doing that.

But even then, it is correct: the multiplication by delta comes after the normalisation - if we multiplied those terms with delta, we'd negate the delta scaling for the normalised kernel. So I think those are correct.

mcaroba commented 2 years ago

That part only gets executed if the kernel is not permutationally invariant, therefore needs to be summed up over the possible permutations, and needs normalisation after doing that.

But even then, it is correct: the multiplication by delta comes after the normalisation - if we multiplied those terms with delta, we'd negate the delta scaling for the normalised kernel. So I think those are correct.

I don't get this last part, these kernels don't appear to be multiplied by delta after the normalization either.

albapa commented 2 years ago

You multiply before :)

mcaroba commented 2 years ago

You multiply before :)

Sorry, I still don't get it. Here you set the covariance for those ..._ii terms:

           covariancePP_ii = covariancePP(r_ii,PP_Q, this%d)
           gpCoordinates_Covariance_ii = gpCoordinates_Covariance_ii + covariancePP_ii

which uses the covariancePP function which knows nothing about delta. After this there is a normalization step, but no multiplication by delta. So no delta before or after. Sorry if I'm being thick, I have to acknowledge I don't really know what's going on in this part of the code!

Anyway, I will patch the code with your fix and try to fit a production potential with a full database, see what happens.

albapa commented 2 years ago

image

albapa commented 2 years ago

if you multiply all Ks on the RHS by delta, it will cancel out! If you only multiply the numerator, you get the delta-scaling

mcaroba commented 2 years ago

if you multiply all Ks on the RHS by delta, it will cancel out! If you only multiply the numerator, you get the delta-scaling

Thanks, now I get it. I thought the numerator was something else.

mcaroba commented 2 years ago

Ok, so I trained a new a-C potential with this fix and for one of our standard RMSE tests the force errors go down by about 8% and the energies < 1%. I'll post more info as it becomes available. It appears to me that while this bug was introducing noise in the forces, the 2b and mb GAPs were flexible enough to compensate. Is there a pain free way to check that the covariance matrix implementation matches the prediction implementation?

albapa commented 2 years ago

We'd need a unit test for this, but unfortunately we don't have anything like that yet.

What I did in the past was the following: using a simple database (e.g. a disturbed crystal) with an arbitrary set of target data (could be an analytic potential) I fitted a potential, but only using the kernel I am interested in, and only using the force information. I then predict energies and compare them to the target energies. If the gradient covariance matrix was correct, I get a correlation with a slope of 1, but an arbitrary intersection.

mcaroba commented 2 years ago

Is there a way to print the covariance matrix? If yes, I can compare it to that built with descriptors and gradients built with quippy. I only need a small database (1 structure) to test it this way.

albapa commented 2 years ago

@Sideboard wrote some nice print interfaces! Could you give us a pointer?

Sideboard commented 2 years ago

Should be linear_system_dump_file.

mcaroba commented 2 years ago

OK, it looks like this is fixed with @albapa 's patch. Feel free to close this issue after you push it (if you haven't done so already).

Below is a comparison of the matrix elements in the covariance matrix as output by @Sideboard 's printing routine, comparing the polynomial and square exponential kernels. I am using a 4:1 ratio for the hyperparameters, which is close to the optimal equivalence ratio for these two kernels. This looks good enough to me.

image