Closed bernstei closed 2 years ago
We've studied this quite a bit with @Sideboard and @jameskermode and our conclusion was that the linear system is so ill-conditioned that even the smallest perturbation will change the weights quite considerably. The good thing is that the physics of the model does not seem to be sensitive. Our idea for a test was a Delta-type test, and those work very reliably.
And for even differences way larger than 1e-8 the model is close to equivalent.
Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?
Can you change the test to something that uses np.isclose()
or equivalent?
Is the seed being applied to only the sparse points, but not the jitter, or something?
Why is there any change? It should be bit-exact. My problem is the usual one - I then use the GAP to do minimization, from those make another GAP, etc, and eventually it's different. I agree that the physics isn't different, but it's harder to write a unit test. I can work around it, but why is it even an issue?
I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.
I remember a discussion precisely on this on Slack - conclusion was floating point algebra isn't exact, that's why.
It's not exact, but it is deterministic (in the absence of OpenMP anyway - I guess I should check that it's disabled for this test, but I think it is).
The jitter is a simple constant, although somewhat counterintuitively. If the sparse points are the same (and they should be) then it's the way the covariance matrix is built, and then the solution of the system.
Are the array operations in Fortran deterministic? Like addition, multiplication, sums etc? We make extensive use of those.
But @Sideboard built in a series of prints that can dump all kinds of intermediate matrices, so it is possible to check where things go awry.
I believe that without OpenMP they are deterministic, but I can investigate this. If you could tell me the interface to the intermediate print stuff (just verbosity?) it'd save me the time to add exactly that.
The alphas are similar, so I assume that the sparse points are the same.
Interesting - looks like maybe it is an OpenMP thing, because when I run twice manually it is reproducible exactly. If I convince myself of this, I'll close this issue.
Is this true also for turbosoap?
Don't know yet. But I currently think it's a false positive, and my fault for not fully disabling OpenMP.
For unit testing you could make the design matrix small enough and the environments distinct enough that the condition number is not huge and then you'd get better reproducibility
It's already not bad, even with OpenMP - at the first 2 iterations I compare the predicted energies and they are within 1e-5. It's only that it builds on itself, and by the 3rd iteration the RSS minima are different, and then it falls apart. But as I said, I'm pretty sure I just failed to disable OpenMP everywhere (specifically in run generating the reference data, which is eventually used by the final pytest
where OpenMP was disabled).
[added] the design matrix is small just so the fits are fast.
Looks like it's a subtle np.dot()
issue, perhaps a bug. Apparently I get a slightly different dot product, therefore a different config energy_sigma, therefore different GAP, and it all builds from there.
It's not a determinism issue, apparently - just different results from np.dot(v1, v2)
and np.sum(v1 * v2)
on different machines (a compute node generating the reference data and the head node runningthe final pytest
)
Here's a script that shows the problem. On our older cpus it prints 0.0, but on newer CPUs (with OpenBLAS, set to 1 thread) it prints a number of order 1e-16. The head node running pytest is new enough to give the 1e-16, while the node generating reference data was old enough to give 0.0.
import numpy as np
p = np.asarray([16.90513420661038423986610723659396171569824218750000, 0.50000000000000000000000000000000000000000000000000, -1.71583683999999991875995419832179322838783264160156])
v = np.asarray([0.01628648030646704172874628113731887424364686012268, -0.17188353256547875269610869963798904791474342346191, -0.98498264035059979182307188239064998924732208251953])
val_dot = np.dot(v, p)
val_sum = np.sum(v * p)
np.show_config()
print("dot - sum", val_dot - val_sum)
I opened this https://github.com/xianyi/OpenBLAS/issues/3583. Once I'm 100% sure it's this, I'll close this issue, but we should all be careful with numpy, which uses OpenBLAS by default.
I remember having different results for identical Turbomole runs once (in the last digits). The reason was afair different math libraries on the nodes. I suppose you use a cluster with uniform nodes so that shouldn't be a problem.
For testing gap_fit I also encountered the problem that even OpenMP runs with identical options (threads, chunk size) are non-deterministic as addition is not commutable for floating point arithmetic, and threads may be faster or slower. This can be dampened by explicit ordered
reduction. I have a branch where I tested it in one spot. It works but makes the code more verbose, so the question is if it's worth the effort. The restriction to identical OpenMP options would still apply.
Sorry, I should have closed this. Serial gap_fit
is deterministic. I agree that it may be possible to make OpenMP and/or MPI deterministic as well (I know I did something like that a few years ago for CP2K with ScaLAPACK or PBLAS or something), but I don't think it's worth the effort for what I need. My real problem is that np.dot
is architecture dependent when using OpenBLAS (at least the version provided by conda), because on avx512 machines it uses some operation that is slightly more accurate than the default behavior. That was affecting my GAP multistage fit script, and therefore giving slightly different potentials.
I am relying on
gap_fit
to be reproducible, and it doesn't seem to be even if I passseed
. Any idea why? It's at of order 1e-8 levels, but through a feedback process it breaks my unit tests. I should be able to provide a fitting database and command line as an example, so I'll try to package that.