plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
363 stars 289 forks source link

PBC distance does not work with all compilers #524

Closed carlocamilloni closed 3 years ago

carlocamilloni commented 5 years ago

I had problems with PGI and it was reported an issue with gcc 6 @GiovanniBussi how can we improve it?

GiovanniBussi commented 5 years ago

If the problem is the warning "WARNING: LatticeReduction ..." there are two possible solutions.

  1. In file tools/LatticeReduction.cpp, increase const double epsilon=1e-14 by a couple of orders of magnitude (1e-12)

  2. In the same file, search for "WARNING" (note that it appears twice, you should probably correct both) and add a break, something like: if(counter>=10000){ fprintf(stderr,"WARNING: LatticeReduction::reduce stuck after %u iterations\n",counter);

    • break;* }

The first solution perhaps is a bit better because it will still stop the simulation if the compiler is screwing something up in the optimization. But the second is also fine, if the users then see the warning. Or perhaps you can implement both, so that the check is less strict and when there is a problem the user can at least continue the simulation (being warned).

I think this was the problem that you (@carlocamilloni) told me a few weeks ago. I am not sure that the problem reported on the mailing list is the same one. If you have chance to try on the machine leading to the problem, let me know if this solves the issue. Thanks!

Giovanni

On Fri, Sep 27, 2019 at 9:44 AM Carlo Camilloni notifications@github.com wrote:

I had problems with PGI and it was reported an issue with gcc 6 @GiovanniBussi https://github.com/GiovanniBussi how can we improve it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plumed/plumed2/issues/524?email_source=notifications&email_token=ABBNCXVKBPCGTMNGGSXXXTDQLW2ULA5CNFSM4I3C65U2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOCSENQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBNCXR6GOFLNZSXQHTEAEDQLW2ULANCNFSM4I3C65UQ .

carlocamilloni commented 5 years ago

about this, with PGI I tried to increase the constant but that was not the problem, I think it is more tricky. I also tried to remove any code optimisation but was not working as well. I think somehow the two issues are connected because the results were related to some kind of miscalculation, compiler dependent, of PBCs.

GiovanniBussi commented 3 years ago

I managed to check with nvc++ (formerly pgi). This compiler is heavily broken I am afraid. The problematic part is here. There is a bug in optimizing this loop that makes it so two of the three vectors in v become identical and the call to reduce() at the next iteration fails because module2(b) here is zero.

I tried to rewrite the loop like this and some tests are passed:

+    double mtrial;
+    Vector trial;
+    auto best=v[2];
+    auto mbest=modulo2(best);
     for(int x1=x1min; x1<=x1max; x1++)
       for(int x2=x2min; x2<=x2max; x2++) {
         trial=v[2]+x2*v[1]+x1*v[0];
         mtrial=modulo2(trial);
-        if(first || mtrial<mbest) {
+        if(mtrial<mbest) {
           mbest=mtrial;
           best=trial;
-          first=false;
         }
       }

Basically, instead of relying on the variable first to make sure that best is assigned at least once, I pre-assign it to v[2]. This fixes some tests, but other tests still fail.

EDIT: specifically, with the code above the problem is that with some cells the best vector at the end of the loop is identical to v[0], which is not possible for the specific vectors I feed. Adding a std::cerr<<trial in the middle of the loop to check which vectors are tried makes the problem disappear, clearly pointing at some randomness in the result.

I don't have a solution, I am afraid this compiler is just not reliable enough.

A small improvement is to crash when the iterated vectors become NaN, instead of continuing forever. I will add this back to version 2.6.

Notice that, in addition to this (and probably unrelated), I see another bunch of failures: systematically in CVs using RMSD, plus a few others. I will check in case this is a signature of some undefined behavior in the RMSD code, but in any case given the point above this compiler will not be usable.

GiovanniBussi commented 3 years ago

I tried to report this to nvidia, let's see if they can fix it.

GiovanniBussi commented 3 years ago

@carlocamilloni I close this, see #711