Open drew-parsons opened 1 year ago
@drew-parsons could you try this patch for the ch5n_nbo failure?
diff --git a/QA/tests/ch5n_nbo/ch5n_nbo.nw b/QA/tests/ch5n_nbo/ch5n_nbo.nw
index 976585de0f..a291436ff4 100644
--- a/QA/tests/ch5n_nbo/ch5n_nbo.nw
+++ b/QA/tests/ch5n_nbo/ch5n_nbo.nw
@@ -23,6 +23,7 @@ geometry
end
scf
+ direct
sym off
adapt off
end
Additional context Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.
ch4-scf-dft-prop and ch4-dft-scf-prop were not part of the default QA script doqmtests.mpi
for release 7.2.0 (and unfortunately are not part of it even in the current master ...), therefore they were not thoroughly tested as the test appearing in doqmtests.mpi
scf
- direct sym off adapt off end
Good suggestion, adding direct
enables the test to pass. The full calculated energy becomes
Total SCF energy = -94.679444210504
(was -94.679444920261 without direct
)
compared to
Total SCF energy = -94.679444210479
in the reference file, which is within the tolerance required for the QA tests.
Thanks for the feedback. Unfortunately, this is not a very good fix. The root cause of the failure is in the rounding method used in the NWChem perl scripts. I might be able to get a better fix sometime soon.
Ok. I'll apply this patch to the debian tests in the meantime.
Current master and hotfix/release-7-2-0 have now a more reliable fix to address this issue.
v7.2.1 is now available for download
Thanks. Debian builds are underway.
v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem The ch5n_nbo test is now passing.
v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem The ch5n_nbo test is now passing.
Great. The patch below should fix the loong64 failure
diff --git a/src/config/makefile.h b/src/config/makefile.h
index f6cd4e44e9..3702fbfb1e 100644
--- a/src/config/makefile.h
+++ b/src/config/makefile.h
@@ -2131,6 +2131,9 @@ ifneq ($(TARGET),LINUX)
CFLAGS_FORGA = -mabi=64
endif
+ ifeq ($(_CPU),loong64)
+ DONTHAVEM64OPT=Y
+ endif
ifeq ($(_CPU),riscv64)
DONTHAVEM64OPT=Y
COPTIONS = -march=rv64gc -mabi=lp64d
_CPU
is uname -m
. I think that's the first part of the multiarch triplet. The loong64 log at
https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-2&stamp=1697224746&raw=0
shows the triplet as loongarch64-linux-gnu
(in /usr/lib/loongarch64-linux-gnu/libblas.so
etc).
Should the _CPU
be loongarch64
rather than loong64
in that case?
I don't have a loong64 machine to log into and check, but can try it one way or the other in the next build upload.
You are right. Thanks for spotting this
I have no name!@durian:/# uname -a
Linux durian 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 13:12:03 UTC 2 loongarch64 GNU/Linux
I have no name!@durian:/# uname -m
loongarch64
loong64 has now built successfully (via loongarch64), https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-3&stamp=1697658700&raw=0
I've one more general question related to tests. In this issue we got the ch5n_nbo test passing, and it's passing on all arches (apart from some MPI trouble on s390x and ia64[gcc]).
But on some of the less common architectures some of the other tests fail in trivial ways (just a small difference in energy). Pragmatically it might be simplest to just drop selected failing tests on the less common architectures, so CI can pass on the other tests.
Tests with only trivial differences in energy compared to the amd64 reference include
Tests with more serious failure include
A number of tests ran out of heap memory on s390x, I don't think that's a bug.
Test logs are a combination of run-time (CI) testing at https://ci.debian.net/packages/n/nwchem/ and build-time testing at https://buildd.debian.org/status/package.php?p=nwchem (debian build time test failures are currently ignored, the green bands indicate the build succeeded, not necessarily all the tests)
I can start skipping these failing tests in the debian builds for these minor architectures so that we can monitor reliable passing of the other tests (i.e. not ignore test failures). Let me know if you want me to apply any specific patches.
Thanks for the detailed update. I will try to see what I can replicate/reproduce with a generic build on some of the architectures. I have never tried to use s390x ... I am even surprised that the code can build and run.
I am surprised the code works on x32 ... do you get just a single failure on n2_ccsd?
For sure it means the code is in good condition, that it builds and passes most tests on most minor architectures.
That's right, all the other tests passed on x32, n2_ccsd is the only one that failed. The x32 build log is https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-3&stamp=1697647806&raw=0
In patch 02_makefile_flags.patch we added LINUX64
alongside LINUX
in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).
I see cosmo_h3co mentioned twice for sparc64 ... does it result in small failure or a SIGBUS?
The sparc64 log is https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=sparc64&ver=7.2.1-3&stamp=1697877511&raw=0
cosmo_h3co fails with the SIGBUS with mpich,
It was cosmo_h2o not cosmo_h3co that fails with the small energy difference,
-Total SCF energy = -76.04274
+Total SCF energy = -76.05245
alpha was slow to build but has finally declared it wants to join the DONTHAVEM64OPT group. https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=alpha&ver=7.2.1-3&stamp=1697807798&raw=0
In patch 02_makefile_flags.patch we added
LINUX64
alongsideLINUX
in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).
The following bit of 02_makefile_flags.patch
does not seem a good idea to me, since you end up using the 32-bit options for 64-bit linux
In the proper LINUX64 makefile.h part, you can now set USE_HWOPT=n
to disable all the native CPU optimizations.
-ifeq ($(TARGET),$(findstring $(TARGET),LINUX CYGNUS CYGWIN))
+ifeq ($(TARGET),$(findstring $(TARGET),LINUX LINUX64 CYGNUS CYGWIN))
USE_HWOPT will help the generic debian package builds. I'll try it along with removing the LINUX64 patch in the next upload.
powerpc architecture: I am not able to reproduce your failures (other than n2_ccsd). There two difference in my build from the debian ci: I am using the vanilla 7.2.1 makefile.h and tools/GA using the builting ga 5.8.2 (tested both ARMCI_NETWORK=MPI-TS and ARMCI_NETWORK=ARMCI)
Strange about powerpc. We have an external ga but it's still 5.8.2.
With alpha (build log), it's getting the wrong flags for gcc building peigs,
make[5]: Entering directory '/<<PKGBUILDDIR>>/build-mpich/src/peigs/src/c'
Compiling clustrfix.c...
cc: error: unrecognized command-line option ‘-fast’; did you mean ‘-Ofast’?
cc: error: unrecognized command-line option ‘-tune’; did you mean ‘-mtune=’?
cc: error: unrecognized command-line option ‘-arch’
make[5]: *** [GNUmakefile:281: clustrfix.o] Error 1
That's coming from https://github.com/nwchemgit/nwchem/blob/060f94520791714054bdf6910a6dfc28045ec2be/src/peigs/DEFS#L496 What's the best way to handle it?
x32 is apparently not happy having to handle LINUXCPU, https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-4&stamp=1698161945&raw=0
Making libraries in rtdb
Compiling rtdb_seq.c...
In file included from rtdb_seq.c:5:
/usr/include/stdlib.h:26:10: fatal error: bits/libc-header-start.h: No such file or directory
26 | #include <bits/libc-header-start.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
I'm a bit confused by the problem. nwchem is just including stdlib.h, stdlib.h should know what bits it needs. There's no special handling for LINUXCPU=x32 anyway, the behaviour shouldn't have changed.
Let me have a look at the alpha
build.
x32
should not be considered as a viable NWChem architecture. I would rather deprecated.
I am making some progress on the s390x
issues. They might be due to the debian shipped OpenBLAS libraries.
One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.
One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.
Could you elaborate a bit on how to switch from generic libblas-dev to libopenblas-dev at runtime?
It just needs libopenblas-dev to be installed, nothing else. Then the install scripts use the debian alternatives mechanism to set a symlink for the preferred libblas.so. The preference can be set manually using update-alternatives
, but the default alternative is determined by priority values defined for each implementation in their installation scripts (openblas is higher priority than generic blas).
There is an additional choice about which openblas build. libopenblas-dev will install the pthread build (libopenblas-pthread-dev
) if not otherwise specified, but openmp threads (libopenblas-openmp-dev
) and no threading (libopenblas-serial-dev
) is also available.
64-bit BLAS is also an option (libblas64-dev, etc). We've had a request to provide a longint nwchem build, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=935993 But that's a separate library, libblas64.so
not libblas.so
.
64-bit BLAS is not enough, since we have Scalapack, too. Scalapack has been ported to use 64-bit integers for a while (this I what I use by default and do strongly recommend). I wish these changes would make it to debian some time soon. From https://github.com/Reference-ScaLAPACK/scalapack/releases/tag/v2.2.0
Release 2.2.0
New features:
Allow compilation in ILP64 mode, PR https://github.com/Reference-ScaLAPACK/scalapack/pull/19
I am testing the alpha
plaform fixes.
My earlier analysis on the s390x just proved to be wrong. It was an issue with 64bit to 32bit conversion. I am guessing that s390x and sparc64 being big endian architectures are less forgiving for this kind of issues.
Endianess can be tricky.
As far as scalapack goes, we have a bug request to provide a 64-bit build at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961186
I might craft a 7.2.2 release with all these latest changes
https://github.com/nwchemgit/nwchem/pull/894/commits/776068e3d2981ab6f83a3b24287ae85c363d0576 the change to ALPHA_CPU, should that apply also to CRAY-T3 ? https://github.com/nwchemgit/nwchem/blob/db970c56fff4ca515374d0d4514998a03f592837/src/peigs/DEFS#L86 https://github.com/nwchemgit/nwchem/blob/db970c56fff4ca515374d0d4514998a03f592837/src/peigs/DEFS#L102
I believe there are no Cray-T3D or Cray-T3E powered on at this point in time. They are only present in computer museum.s We should delete all those lines of makefile at some point. https://cray-cyber.org/old/systems/t3e.php
It sounds like the time has come to consider obsolete ia64 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e0c505e13162a2abe7c984309cfe2ae976b428d
Will be interesting to see how long debian keeps it on the build list. Likely debian will have to drop it too if it's no longer in the linux kernel.
NWChem 7.2.2 is out with all the latest fixes to make the build work on big-endian architectures (the fix for Python 3.12 https://github.com/nwchemgit/nwchem/issues/892 is there, too).
https://github.com/nwchemgit/nwchem/releases/tag/v7.2.2-release
https://github.com/nwchemgit/nwchem/blob/master/release.notes.7.2.2.md
GA 5.8.2 needs to be patched to work on big-endian architectures with the following patches available in the ga develop branch
https://github.com/GlobalArrays/ga/commit/adccda8f2ac107c70eafa2c9c5b6e83e453107a7.patch
https://github.com/GlobalArrays/ga/commit/c833499616ca8fe2bd13c59dd666a404a80f9405.patch
Describe the bug I'm running QA tests on a debian package build of nwchem 7.2.0. All the CI tests that debian runs are passing, except for ch5n_nbo which fails trivially (energy difference 1e-5 Hartree). What's the best way to handle it?
Describe settings used Building with gcc 13.2.0 with both openmpi 4.1.5 and mpich 4.1.2. Linux (Debian unstable)
To Reproduce
cd QA
MPIRUN_PATH=/usr/bin/mpirun.openmpi NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.openmpi NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo
2b.MPIRUN_PATH=/usr/bin/mpirun.mpich NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.mpich NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo
Expected behavior Precision in energy calculations should be reproducible such that tests pass.
Screenshots
Additional context Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.