gap_fit is not reproducible even passing in seed

bernstei commented 2 years ago

I am relying on gap_fit to be reproducible, and it doesn't seem to be even if I pass seed. Any idea why? It's at of order 1e-8 levels, but through a feedback process it breaks my unit tests. I should be able to provide a fitting database and command line as an example, so I'll try to package that.

albapa commented 2 years ago

We've studied this quite a bit with @Sideboard and @jameskermode and our conclusion was that the linear system is so ill-conditioned that even the smallest perturbation will change the weights quite considerably. The good thing is that the physics of the model does not seem to be sensitive. Our idea for a test was a Delta-type test, and those work very reliably.

albapa commented 2 years ago

And for even differences way larger than 1e-8 the model is close to equivalent.

bernstei commented 2 years ago

Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?

albapa commented 2 years ago

Can you change the test to something that uses np.isclose() or equivalent?

bernstei commented 2 years ago

Is the seed being applied to only the sparse points, but not the jitter, or something?

albapa commented 2 years ago

Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?

I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.

bernstei commented 2 years ago

I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.

It's not exact, but it is deterministic (in the absence of OpenMP anyway - I guess I should check that it's disabled for this test, but I think it is).

albapa commented 2 years ago

The jitter is a simple constant, although somewhat counterintuitively. If the sparse points are the same (and they should be) then it's the way the covariance matrix is built, and then the solution of the system.

albapa commented 2 years ago

Are the array operations in Fortran deterministic? Like addition, multiplication, sums etc? We make extensive use of those.

But @Sideboard built in a series of prints that can dump all kinds of intermediate matrices, so it is possible to check where things go awry.

bernstei commented 2 years ago

I believe that without OpenMP they are deterministic, but I can investigate this. If you could tell me the interface to the intermediate print stuff (just verbosity?) it'd save me the time to add exactly that.

bernstei commented 2 years ago

The alphas are similar, so I assume that the sparse points are the same.

bernstei commented 2 years ago

Interesting - looks like maybe it is an OpenMP thing, because when I run twice manually it is reproducible exactly. If I convince myself of this, I'll close this issue.

gabor1 commented 2 years ago

Is this true also for turbosoap?

bernstei commented 2 years ago

Don't know yet. But I currently think it's a false positive, and my fault for not fully disabling OpenMP.

gabor1 commented 2 years ago

For unit testing you could make the design matrix small enough and the environments distinct enough that the condition number is not huge and then you'd get better reproducibility

bernstei commented 2 years ago

It's already not bad, even with OpenMP - at the first 2 iterations I compare the predicted energies and they are within 1e-5. It's only that it builds on itself, and by the 3rd iteration the RSS minima are different, and then it falls apart. But as I said, I'm pretty sure I just failed to disable OpenMP everywhere (specifically in run generating the reference data, which is eventually used by the final pytest where OpenMP was disabled).

[added] the design matrix is small just so the fits are fast.

bernstei commented 2 years ago

Looks like it's a subtle np.dot() issue, perhaps a bug. Apparently I get a slightly different dot product, therefore a different config energy_sigma, therefore different GAP, and it all builds from there.

bernstei commented 2 years ago

It's not a determinism issue, apparently - just different results from np.dot(v1, v2) and np.sum(v1 * v2) on different machines (a compute node generating the reference data and the head node runningthe final pytest)

bernstei commented 2 years ago

Here's a script that shows the problem. On our older cpus it prints 0.0, but on newer CPUs (with OpenBLAS, set to 1 thread) it prints a number of order 1e-16. The head node running pytest is new enough to give the 1e-16, while the node generating reference data was old enough to give 0.0.

import numpy as np

p = np.asarray([16.90513420661038423986610723659396171569824218750000, 0.50000000000000000000000000000000000000000000000000, -1.71583683999999991875995419832179322838783264160156])
v = np.asarray([0.01628648030646704172874628113731887424364686012268, -0.17188353256547875269610869963798904791474342346191, -0.98498264035059979182307188239064998924732208251953])

val_dot = np.dot(v, p)
val_sum = np.sum(v * p)

np.show_config()
print("dot - sum", val_dot - val_sum)

bernstei commented 2 years ago

I opened this https://github.com/xianyi/OpenBLAS/issues/3583. Once I'm 100% sure it's this, I'll close this issue, but we should all be careful with numpy, which uses OpenBLAS by default.

Sideboard commented 2 years ago

I remember having different results for identical Turbomole runs once (in the last digits). The reason was afair different math libraries on the nodes. I suppose you use a cluster with uniform nodes so that shouldn't be a problem.

For testing gap_fit I also encountered the problem that even OpenMP runs with identical options (threads, chunk size) are non-deterministic as addition is not commutable for floating point arithmetic, and threads may be faster or slower. This can be dampened by explicit ordered reduction. I have a branch where I tested it in one spot. It works but makes the code more verbose, so the question is if it's worth the effort. The restriction to identical OpenMP options would still apply.

bernstei commented 2 years ago

Sorry, I should have closed this. Serial gap_fit is deterministic. I agree that it may be possible to make OpenMP and/or MPI deterministic as well (I know I did something like that a few years ago for CP2K with ScaLAPACK or PBLAS or something), but I don't think it's worth the effort for what I need. My real problem is that np.dot is architecture dependent when using OpenBLAS (at least the version provided by conda), because on avx512 machines it uses some operation that is slightly more accurate than the default behavior. That was affecting my GAP multistage fit script, and therefore giving slightly different potentials.

libAtoms / QUIP

gap_fit is not reproducible even passing in seed #522