nwchemgit / nwchem

NWChem: Open Source High-Performance Computational Chemistry
http://nwchemgit.github.io
Other
514 stars 162 forks source link

ch5n_nbo QA test fails from v7.2.0 #864

Open drew-parsons opened 1 year ago

drew-parsons commented 1 year ago

Describe the bug I'm running QA tests on a debian package build of nwchem 7.2.0. All the CI tests that debian runs are passing, except for ch5n_nbo which fails trivially (energy difference 1e-5 Hartree). What's the best way to handle it?

Describe settings used Building with gcc 13.2.0 with both openmpi 4.1.5 and mpich 4.1.2. Linux (Debian unstable)

To Reproduce

  1. cd QA
  2. MPIRUN_PATH=/usr/bin/mpirun.openmpi NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.openmpi NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo 2b. MPIRUN_PATH=/usr/bin/mpirun.mpich NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.mpich NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo

Expected behavior Precision in energy calculations should be reproducible such that tests pass.

Screenshots

$ MPIRUN_PATH=/usr/bin/mpirun.mpich NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.mpich NWCHEM_TOP=`pwd`/.. ./runtests.mpi.unix procs 1 ch5n_nbo

 Running tests/ch5n_nbo/ch5n_nbo 

     cleaning scratch
     copying input and verified output files
     running nwchem (/usr/bin/nwchem.mpich)  with 1 processors 

     verifying output ... failed
@@@     Comparison of Output Files
@@ -1,3 +1,3 @@
 Effective nuclear repulsion energy (a.u.) 42.05
-Total SCF energy = -94.67944
-Total SCF energy = -94.67944
+Total SCF energy = -94.67945
+Total SCF energy = -94.67945

Failed
$ echo $?
1

Additional context Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.

edoapra commented 1 year ago

@drew-parsons could you try this patch for the ch5n_nbo failure?

diff --git a/QA/tests/ch5n_nbo/ch5n_nbo.nw b/QA/tests/ch5n_nbo/ch5n_nbo.nw
index 976585de0f..a291436ff4 100644
--- a/QA/tests/ch5n_nbo/ch5n_nbo.nw
+++ b/QA/tests/ch5n_nbo/ch5n_nbo.nw
@@ -23,6 +23,7 @@ geometry
 end

 scf
+ direct
  sym off
  adapt off
 end
edoapra commented 1 year ago

Additional context Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.

ch4-scf-dft-prop and ch4-dft-scf-prop were not part of the default QA script doqmtests.mpi for release 7.2.0 (and unfortunately are not part of it even in the current master ...), therefore they were not thoroughly tested as the test appearing in doqmtests.mpi

drew-parsons commented 1 year ago

scf

  • direct sym off adapt off end

Good suggestion, adding direct enables the test to pass. The full calculated energy becomes

Total SCF energy =    -94.679444210504

(was -94.679444920261 without direct) compared to

Total SCF energy =    -94.679444210479

in the reference file, which is within the tolerance required for the QA tests.

edoapra commented 1 year ago

Thanks for the feedback. Unfortunately, this is not a very good fix. The root cause of the failure is in the rounding method used in the NWChem perl scripts. I might be able to get a better fix sometime soon.

drew-parsons commented 1 year ago

Ok. I'll apply this patch to the debian tests in the meantime.

edoapra commented 1 year ago

Current master and hotfix/release-7-2-0 have now a more reliable fix to address this issue.

edoapra commented 1 year ago

v7.2.1 is now available for download

drew-parsons commented 1 year ago

Thanks. Debian builds are underway.

drew-parsons commented 1 year ago

v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem The ch5n_nbo test is now passing.

edoapra commented 1 year ago

v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem The ch5n_nbo test is now passing.

Great. The patch below should fix the loong64 failure

diff --git a/src/config/makefile.h b/src/config/makefile.h
index f6cd4e44e9..3702fbfb1e 100644
--- a/src/config/makefile.h
+++ b/src/config/makefile.h
@@ -2131,6 +2131,9 @@ ifneq ($(TARGET),LINUX)
             CFLAGS_FORGA   = -mabi=64
         endif

+        ifeq ($(_CPU),loong64)
+            DONTHAVEM64OPT=Y
+        endif
         ifeq ($(_CPU),riscv64)
             DONTHAVEM64OPT=Y
             COPTIONS   =  -march=rv64gc -mabi=lp64d
drew-parsons commented 1 year ago

_CPU is uname -m. I think that's the first part of the multiarch triplet. The loong64 log at https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-2&stamp=1697224746&raw=0 shows the triplet as loongarch64-linux-gnu (in /usr/lib/loongarch64-linux-gnu/libblas.so etc).

Should the _CPU be loongarch64 rather than loong64 in that case?

I don't have a loong64 machine to log into and check, but can try it one way or the other in the next build upload.

edoapra commented 1 year ago

You are right. Thanks for spotting this

I have no name!@durian:/# uname -a
Linux durian 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 13:12:03 UTC 2 loongarch64 GNU/Linux
I have no name!@durian:/# uname -m
loongarch64
drew-parsons commented 1 year ago

loong64 has now built successfully (via loongarch64), https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-3&stamp=1697658700&raw=0

I've one more general question related to tests. In this issue we got the ch5n_nbo test passing, and it's passing on all arches (apart from some MPI trouble on s390x and ia64[gcc]).

But on some of the less common architectures some of the other tests fail in trivial ways (just a small difference in energy). Pragmatically it might be simplest to just drop selected failing tests on the less common architectures, so CI can pass on the other tests.

Tests with only trivial differences in energy compared to the amd64 reference include

Tests with more serious failure include

A number of tests ran out of heap memory on s390x, I don't think that's a bug.

Test logs are a combination of run-time (CI) testing at https://ci.debian.net/packages/n/nwchem/ and build-time testing at https://buildd.debian.org/status/package.php?p=nwchem (debian build time test failures are currently ignored, the green bands indicate the build succeeded, not necessarily all the tests)

I can start skipping these failing tests in the debian builds for these minor architectures so that we can monitor reliable passing of the other tests (i.e. not ignore test failures). Let me know if you want me to apply any specific patches.

edoapra commented 1 year ago

Thanks for the detailed update. I will try to see what I can replicate/reproduce with a generic build on some of the architectures. I have never tried to use s390x ... I am even surprised that the code can build and run.

edoapra commented 1 year ago

I am surprised the code works on x32 ... do you get just a single failure on n2_ccsd?

drew-parsons commented 1 year ago

For sure it means the code is in good condition, that it builds and passes most tests on most minor architectures.

That's right, all the other tests passed on x32, n2_ccsd is the only one that failed. The x32 build log is https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-3&stamp=1697647806&raw=0

In patch 02_makefile_flags.patch we added LINUX64 alongside LINUX in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).

edoapra commented 1 year ago

I see cosmo_h3co mentioned twice for sparc64 ... does it result in small failure or a SIGBUS?

drew-parsons commented 1 year ago

The sparc64 log is https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=sparc64&ver=7.2.1-3&stamp=1697877511&raw=0

cosmo_h3co fails with the SIGBUS with mpich,

It was cosmo_h2o not cosmo_h3co that fails with the small energy difference,

-Total SCF energy = -76.04274
+Total SCF energy = -76.05245
drew-parsons commented 1 year ago

alpha was slow to build but has finally declared it wants to join the DONTHAVEM64OPT group. https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=alpha&ver=7.2.1-3&stamp=1697807798&raw=0

edoapra commented 1 year ago

In patch 02_makefile_flags.patch we added LINUX64 alongside LINUX in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).

The following bit of 02_makefile_flags.patch does not seem a good idea to me, since you end up using the 32-bit options for 64-bit linux In the proper LINUX64 makefile.h part, you can now set USE_HWOPT=n to disable all the native CPU optimizations.

-ifeq ($(TARGET),$(findstring $(TARGET),LINUX CYGNUS CYGWIN))
+ifeq ($(TARGET),$(findstring $(TARGET),LINUX LINUX64 CYGNUS CYGWIN))
drew-parsons commented 1 year ago

USE_HWOPT will help the generic debian package builds. I'll try it along with removing the LINUX64 patch in the next upload.

edoapra commented 1 year ago

powerpc architecture: I am not able to reproduce your failures (other than n2_ccsd). There two difference in my build from the debian ci: I am using the vanilla 7.2.1 makefile.h and tools/GA using the builting ga 5.8.2 (tested both ARMCI_NETWORK=MPI-TS and ARMCI_NETWORK=ARMCI)

drew-parsons commented 1 year ago

Strange about powerpc. We have an external ga but it's still 5.8.2.

With alpha (build log), it's getting the wrong flags for gcc building peigs,

make[5]: Entering directory '/<<PKGBUILDDIR>>/build-mpich/src/peigs/src/c'
Compiling clustrfix.c...
cc: error: unrecognized command-line option ‘-fast’; did you mean ‘-Ofast’?
cc: error: unrecognized command-line option ‘-tune’; did you mean ‘-mtune=’?
cc: error: unrecognized command-line option ‘-arch’
make[5]: *** [GNUmakefile:281: clustrfix.o] Error 1

That's coming from https://github.com/nwchemgit/nwchem/blob/060f94520791714054bdf6910a6dfc28045ec2be/src/peigs/DEFS#L496 What's the best way to handle it?

x32 is apparently not happy having to handle LINUXCPU, https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-4&stamp=1698161945&raw=0

Making libraries in rtdb
Compiling rtdb_seq.c...
In file included from rtdb_seq.c:5:
/usr/include/stdlib.h:26:10: fatal error: bits/libc-header-start.h: No such file or directory
   26 | #include <bits/libc-header-start.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~

I'm a bit confused by the problem. nwchem is just including stdlib.h, stdlib.h should know what bits it needs. There's no special handling for LINUXCPU=x32 anyway, the behaviour shouldn't have changed.

edoapra commented 1 year ago

Let me have a look at the alpha build.
x32 should not be considered as a viable NWChem architecture. I would rather deprecated.
I am making some progress on the s390x issues. They might be due to the debian shipped OpenBLAS libraries.

drew-parsons commented 1 year ago

One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.

edoapra commented 1 year ago

One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.

Could you elaborate a bit on how to switch from generic libblas-dev to libopenblas-dev at runtime?

drew-parsons commented 1 year ago

It just needs libopenblas-dev to be installed, nothing else. Then the install scripts use the debian alternatives mechanism to set a symlink for the preferred libblas.so. The preference can be set manually using update-alternatives, but the default alternative is determined by priority values defined for each implementation in their installation scripts (openblas is higher priority than generic blas).

There is an additional choice about which openblas build. libopenblas-dev will install the pthread build (libopenblas-pthread-dev) if not otherwise specified, but openmp threads (libopenblas-openmp-dev) and no threading (libopenblas-serial-dev) is also available.

64-bit BLAS is also an option (libblas64-dev, etc). We've had a request to provide a longint nwchem build, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=935993 But that's a separate library, libblas64.so not libblas.so.

edoapra commented 1 year ago

64-bit BLAS is not enough, since we have Scalapack, too. Scalapack has been ported to use 64-bit integers for a while (this I what I use by default and do strongly recommend). I wish these changes would make it to debian some time soon. From https://github.com/Reference-ScaLAPACK/scalapack/releases/tag/v2.2.0

Release 2.2.0

New features:

Allow compilation in ILP64 mode, PR https://github.com/Reference-ScaLAPACK/scalapack/pull/19

edoapra commented 1 year ago

I am testing the alpha plaform fixes.

edoapra commented 1 year ago

My earlier analysis on the s390x just proved to be wrong. It was an issue with 64bit to 32bit conversion. I am guessing that s390x and sparc64 being big endian architectures are less forgiving for this kind of issues.

drew-parsons commented 1 year ago

Endianess can be tricky.

As far as scalapack goes, we have a bug request to provide a 64-bit build at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961186

edoapra commented 1 year ago

I might craft a 7.2.2 release with all these latest changes

drew-parsons commented 1 year ago

https://github.com/nwchemgit/nwchem/pull/894/commits/776068e3d2981ab6f83a3b24287ae85c363d0576 the change to ALPHA_CPU, should that apply also to CRAY-T3 ? https://github.com/nwchemgit/nwchem/blob/db970c56fff4ca515374d0d4514998a03f592837/src/peigs/DEFS#L86 https://github.com/nwchemgit/nwchem/blob/db970c56fff4ca515374d0d4514998a03f592837/src/peigs/DEFS#L102

edoapra commented 1 year ago

I believe there are no Cray-T3D or Cray-T3E powered on at this point in time. They are only present in computer museum.s We should delete all those lines of makefile at some point. https://cray-cyber.org/old/systems/t3e.php

edoapra commented 1 year ago

It sounds like the time has come to consider obsolete ia64 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e0c505e13162a2abe7c984309cfe2ae976b428d

drew-parsons commented 1 year ago

Will be interesting to see how long debian keeps it on the build list. Likely debian will have to drop it too if it's no longer in the linux kernel.

edoapra commented 1 year ago

NWChem 7.2.2 is out with all the latest fixes to make the build work on big-endian architectures (the fix for Python 3.12 https://github.com/nwchemgit/nwchem/issues/892 is there, too).
https://github.com/nwchemgit/nwchem/releases/tag/v7.2.2-release
https://github.com/nwchemgit/nwchem/blob/master/release.notes.7.2.2.md

GA 5.8.2 needs to be patched to work on big-endian architectures with the following patches available in the ga develop branch
https://github.com/GlobalArrays/ga/commit/adccda8f2ac107c70eafa2c9c5b6e83e453107a7.patch
https://github.com/GlobalArrays/ga/commit/c833499616ca8fe2bd13c59dd666a404a80f9405.patch