Closed bernstei closed 1 year ago
Nicely spotted @bernstei, thanks! I guess the bug could also be triggered in the printing of the singular values in the same routine. I will patch it ASAP (I'm in the middle of a job interview atm!). I think I will not make the debug variable an argument of the subroutine but just an internal variable, and apply it also to the SVD printing, just in case.
I think if it's just an internal variable with a value fixed at compile time the compiler knows it's false and doesn't fix the issue, but I'll check.
I checked, and I cannot get it to work with any variable that's purely internal with a value that's known at compile time. I think the compiler is clever enough to tell that the printing section will never run. If you have a specific patch you'd like me to try, let me know.
Yes, that makes sense. OK, I'll follow what you proposed initially, but could you check what happens if you put this:
exp_coeff_temp1 = 0.d0
exp_coeff_temp2 = 0.d0
exp_coeff_der_temp = 0.d0
before the if
immediately preceding it?
I don't understand which if
. What do you mean by "preceding it" ? Which "it"? Right after they're allocated?
Sorry for being so imprecise, I meant this:
exp_coeff_temp1 = 0.d0
exp_coeff_temp2 = 0.d0
exp_coeff_der_temp = 0.d0
if( rjs_in(k) < rcut_hard_in .and. mask(k) )then
instead of this:
if( rjs_in(k) < rcut_hard_in .and. mask(k) )then
exp_coeff_temp1 = 0.d0
exp_coeff_temp2 = 0.d0
exp_coeff_der_temp = 0.d0
That does seem to be sufficient. I don't understand why, though, and I'm a little nervous given how random these optimizer bugs often seem to be - it's hard to tell what will or won't turn off the broken optimization. Did you have a specific reason to suggest this change of where those arrays are zeroed?
BTW, more evidence that this optimizer issue is subtle - changing from -O2
to -O3
also fixes the problem.
These variables were not initialized otherwise for some cases. Can I suggest a last thing to try (sorry, I am in a place where I can't test this myself)? What happens if you change the <
in the condition to <=
instead (leave the original code unchanged otherwise)? If that also fixes the issue, then I think it could be a code bug triggered when atoms are right at the cutoff and an uninitialized variable contains a combination of bits corresponding to a non-finite value (a NaN?). [if the value is finite, then the coefficients will go to zero at the cutoff, since 0 times a finite number is 0; if the value is undefined, then 0 times undefined is undefined - I think this could be happening]
No - changing the <
to <=
in the same if
statement still results in NaN
s.
@bernstei sorry I left this hanging for so long. Could you please check if the latest commit fixes the issue?
Ugh - I have no idea where I put my example inputs. Let me see if I can find that.
Found them, testing it now (not trivial because LAMMPS wants to download a fresh copy tied to a specific git commit).
Great, thanks. If it's not fixed, please send me the files so that I can do the testing for potential fixes myself without having to bother you.
No, the latest commit doesn't fix the problem. So far I've been unable to reproduce it except in LAMMPS, but there's one more thing I want to try.
@bernstei I have been doing some testing on this. It looks like the behavior of LAPACK routines used for the basis orthonormalization depends on the level of optimization. This is one of the W matrix coefficients (W = S^-1/2):
carom1@t410-du-21-01~$ gfortran -c module.f90 carom1@t410-du-21-01~$ gfortran -o calc.ex calc.f90 module.o -llapack -lblas carom1@t410-du-21-01~$ ./calc.ex | grep "alpha_max = 8" -B1 -485.62295892374391
alpha_max = 8 carom1@t410-du-21-01~$ gfortran -c module.f90 -O2 carom1@t410-du-21-01~$ gfortran -o calc.ex -O2 calc.f90 module.o -llapack -lblas carom1@t410-du-21-01~$ ./calc.ex | grep "alpha_max = 8" -B1 -485.62295892808970
alpha_max = 8
There seems to be a precision issue (internal truncation to single precision?) triggered by the optimization level. In my tests, O2 and O3 lead to the same values, which differ from the value w/o optimization flag. I wonder if this is at the root of the issue you're observing. The results of the GAP calculation can be quite sensitive to the value of these coefficients. I'll keep digging, maybe I can force the right behavior.
That's basically not possible, as far as I understand, because optimization shouldn't affect linking, just compilation. I think the change must be in what's getting passed to the lapack routines. I can take a look if you can provide me with module.f90
and calc.f90
.
Those codes are minimal interfaces to use the soap_turbo basis construction routine. I traced the issue back to the use of matmul()
introducing a minimal difference in some intermediate matrix value. When I replace matmul()
with dgemm()
I get the same results regardless of optimization (and they coincided with the optimized matmul()
results). I will make this replacement in the code, but I don't think this could be triggering the issue you experienced. I keep looking...
@mcaroba any progress with this, out of curiosity? I'm recompiling LAMMPS, and noticed that I still have my patches floating around the copy of the source inside the LAMMPS build dir.
@bernstei Actually, this week one of my colleagues who's porting this to GPU reported opt flag dependent NaNs (he said only -O2 triggers it) and we're taking a deep look together probably on Friday. I'll let you know!
Let me know if there's anything I can do to help.
@bernstei I think my colleague may have found the bug, but please check if it goes away when you add this pref_f
initialization on top of the block we were discussing about:
pref_f = 0.d0
exp_coeff_temp1 = 0.d0
exp_coeff_temp2 = 0.d0
exp_coeff_der_temp = 0.d0
Let me see if that fixes mine as well. Can you remind me exactly what file and line you mean?
In soap_turbo_radial.f90
(make sure you have the current version in the master branch), add pref_f = 0.d0
to lines 179 and 511. Please let me know if that fixes the issue for you, I'm very keen to get this sorted!
Is there any particular reason to think this is a systematic fix, or is it just luck with some optimizer bug? At least for line 179, I don't see any evidence setting pred_f
to 0 there should make a difference - from what I can tell its value is already set before it's used. Am I missing something with the flow of the code?
pref_f
is set in line 287, but there is a condition in line 281 that does not always get triggered. When it doesn't, pref_f
is not set and exp_coeff_temp2
is set to 0.d0
. Later, in line 343 you have the term pref_f * exp_coeff_temp2
which is always computed. When the condition in 281 is not met, there's no guarantee that pref_f
is a number. If it's NaN, even if exp_coeff_temp2 = 0.d0
, the product pref_f * exp_coeff_temp2
is undefined. I suppose that different optimization levels affect how the compiler handles this situation.
So did it fix the bug for you?
Aaah - that makes sense. I haven't had a chance to check yet - on vacation, so it might take another week.
The two pref_f = 0
lines definitely fixes the immediate problem.
@bernstei thanks a lot for your help debugging this extremely opaque issue and your patience while we figured this out. I just pushed the changes to the soap_turbo library and updated the submodule on the GAP repo. Thanks again!
There appears to be an optimizer bug in gfortran 9.4.0 where the values for
radial_exp_coeff
thatsoap_turbo.f90::get_soap
gets back fromget_radial_expansion_coefficients_poly3
are NaN. Attempts to printexp_coeff
insideget_radial_expansion_coefficients_poly3
lead to apparently correct (not NaN, anyway) values being returned, which is why I think it's an optimizer bug.Reducing the optimization level from -O2 to -O1 fixes it, but I don't want to do that. The only other way I've found to deal with this even vaguely cleanly is to make the compiler think that it might have to print
exp_coeff
. I can do that without having to change any other is is by making theif (.false.) then
section at the very end, which optionally writesexp_coeff
, dependent on a real variable,my_debug
, and make that variable depend on a new optional input argument,debug
, but default to.false.
. This makes the compiler think (at compile time) that it might be printed, which disables whatever optimization breaks things. Even if that section is never actually run becausedebug = .true.
is never passed in, it's enough to fix the optimization issue.If you think this is worth doing, I can paste a patch here or open a PR.