nwchemgit / nwchem

NWChem: Open Source High-Performance Computational Chemistry
http://nwchemgit.github.io
Other
501 stars 160 forks source link

7.2.1: a few tests fail with mpich during a fedora build #895

Closed marcindulak closed 11 months ago

marcindulak commented 11 months ago

Describe the bug

This is a follow up on #892. It collects the failures encountered when building nwchem 7.2.1 https://github.com/nwchemgit/nwchem/commit/487f8b945fbe9cedf02757dacddc66d40dc74ed9 on fedora 39, with mpich.

It does not look like the found errors are reproducible, maybe some are.

Describe settings used

The build is available at https://koji.fedoraproject.org/koji/taskinfo?taskID=107865675

export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export USE_ARUR=TRUE
export USE_NOFSCHECK=TRUE
export USE_NOIO=TRUE
export NWCHEM_FSCHECK=N
export LARGE_FILES=TRUE
export MRCC_THEORY=Y
export EACCSD=Y
export IPCCSD=Y
export CCSDTQ=Y
export CCSDTLR=Y
export NWCHEM_LONG_PATHS=Y
unset USE_LIBXC
export LIBXC_LIB='/usr/lib64'
export LIBXC_INCLUDE='/usr/include'
export LIBXC_MODDIR='/usr/lib64/gfortran/modules'
export NO_NWPWXC_VDW3A=1
export HAS_BLAS=yes
export BLASOPT='-L/usr/lib64 -lflexiblas'
export BLAS_SIZE='4'
export LAPACK_LIB='-L/usr/lib64 -lflexiblas'
export MAKE=/usr/bin/make
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/usr/lib64/mpich/bin/mpiexec
export LD_LIBRARY_PATH=/usr/lib64/mpich/lib
export EXTERNAL_GA_PATH=/usr/lib64/mpich
make nwchem_config NWCHEM_MODULES="all python" 2>&1
make 64_to_32 2>&1
export MAKEOPTS="USE_64TO32=y"
make ${MAKEOPTS} 2>&1

Attach log files

List the failures on x86_64, this does not include crashes

wget https://kojipkgs.fedoraproject.org//work/tasks/5675/107865675/build.log

grep -E "Running|verifying output ... failed" build.log | grep failed -B 1
 Running tests/h2o_dk/h2o_dk 
     verifying output ... failed
--
 Running tests/prop_ch3f/prop_ch3f 
     verifying output ... failed
--
 Running tests/h2o-campbe0/h2o-campbe0 
     verifying output ... failed

See the build.log file for the reported diffs, pasted below manually.

1.

 Running tests/h2o_dk/h2o_dk

     cleaning scratch
     copying input and verified output files
     running nwchem (/builddir/build/BUILD/nwchem-487f8b945fbe9cedf02757dacddc66d40dc74ed9/bin/LINUX64/nwchem_mpich)  with 2 processors

     verifying output ... failed
@@@     Comparison of Output Files
@@ -1,3 +1,2 @@
-Effective nuclear repulsion energy (a.u.) 9.01
 Total SCF energy = -76.05913
 Total SCF energy = -76.05913

Failed

2.

 Running tests/prop_ch3f/prop_ch3f

     cleaning scratch
     copying input and verified output files
     running nwchem (/builddir/build/BUILD/nwchem-487f8b945fbe9cedf02757dacddc66d40dc74ed9/bin/LINUX64/nwchem_mpich)  with 2 processors

     verifying output ... failed
@@@     Comparison of Output Files
@@ -13,5 +13,4 @@
 DMZ 2.519
 DMZEFC 0.000
-XX -8.702 0.000 -8.702
 YY -8.702 0.000 -8.702
 ZZ -9.405 0.000 -9.405
@@ -42,5 +41,4 @@
 XXY 0.531 0.000 0.531
 XXZ -1.521 0.000 -1.521
-YYX 0.531 0.000 0.531
 YYZ -1.521 0.000 -1.521
 ZZX 0.000 0.000 0.000
@@ -90,6 +88,4 @@
 DMX 0.000
 DMXEFC 0.000
-DMY 0.000
-DMYEFC 0.000
 DMZ 0.853
 DMZEFC 0.000

Failed

3.

 Running tests/h2o-campbe0/h2o-campbe0

     cleaning scratch
     copying input and verified output files
     running nwchem (/builddir/build/BUILD/nwchem-487f8b945fbe9cedf02757dacddc66d40dc74ed9/bin/LINUX64/nwchem_mpich)  with 2 processors

     verifying output ... failed
@@@     Comparison of Output Files
@@ -13,6 +13,4 @@
 DMX 0.000
 DMXEFC 0.000
-DMY 0.000
-DMYEFC 0.000
 DMZ -1.927
 DMZEFC 0.000

Failed

To Reproduce

It seems like the errors obtained when running on fedora builders are nor reproduced using the steps below.

Use the following Dockerfile, providing TEST and an argument, for example docker build --build-arg TEST=tests/h2o_dk/h2o_dk --progress=plain -t test:latest . 2>&1 | cut -d' ' -f3- > out.txt

FROM fedora:39@sha256:95c88cea36312cfd73613f1c56d8a8db8e63be5cd884cfc97ba9475d0a45eac5

RUN set -x \
    && dnf install -y https://kojipkgs.fedoraproject.org//work/tasks/5675/107865675/nwchem-7.2.1-1.fc39.x86_64.rpm \
                      https://kojipkgs.fedoraproject.org//work/tasks/5675/107865675/nwchem-common-7.2.1-1.fc39.noarch.rpm \
                      https://kojipkgs.fedoraproject.org//work/tasks/5675/107865675/nwchem-mpich-7.2.1-1.fc39.x86_64.rpm \
    && dnf clean all

RUN set -x \
    && uname -a

RUN set -x \
    && cat /etc/*release | grep PRETTY

ARG TEST
RUN set -x \
    && curl -sL https://github.com/nwchemgit/nwchem/raw/master/QA/$TEST.nw -o /tmp/test.nw

RUN cd /tmp && . /etc/profile.d/modules.sh&& module use /usr/share/modulefiles&& module load mpi/mpich && FI_PROVIDER=tcp mpiexec -np 1 nwchem_mpich test.nw

CMD ["/bin/bash"]

FI_PROVIDER=tcp is to avoid the error below, I found this suggestion here https://github.com/open-mpi/ompi/issues/11295#issuecomment-1382190063

buildkitsandbox:rank0.nwchem_binary_mpich: Failed to get eth0 (unit 0) cpu set
buildkitsandbox:rank0: PSM3 can't open nic unit: 0 (err=23)
Abort(874633103): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66)........: MPI_Init(argc=0x7ffd27dc401c, argv=0x7ffd27dc4010) failed
MPII_Init_thread(234)....:
MPID_Init(513)...........:
MPIDI_OFI_init_local(604):
create_vni_context(982)..: OFI endpoint open failed (ofi_init.c:982:create_vni_context:Cannot allocate memory)

I'm unable the reproduce failure 1. (tests/h2o_dk/h2o_dk), which suggests that Effective nuclear repulsion energy (a.u.) 9.01 is missing in the generated output?

I'm unable to reproduce failure 2., which suggests that a line like -XX -8.702 0.000 -8.702 is missing in the generated output.

I may reproduce failure 3. (or maybe still not?) - it looks like sometimes minus precedes -0.0000000000 as in DMY -0.0000000000 or in other components, while the reference file contains 0.0000000000.

   Dipole moment        0.7583263865 A.U.
             DMX       -0.0000000000 DMXEFC        0.0000000000
             DMY        0.0000000000 DMYEFC        0.0000000000
             DMZ       -0.7583263865 DMZEFC        0.0000000000
   -EFC- dipole         0.0000000000 A.U.
   Total dipole         0.7583263865 A.U.

   Dipole moment        1.9274879834 Debye(s)
             DMX       -0.0000000000 DMXEFC        0.0000000000
             DMY        0.0000000000 DMYEFC        0.0000000000
             DMZ       -1.9274879834 DMZEFC        0.0000000000
   -EFC- dipole         0.0000000000 DEBYE(S)
   Total dipole         1.9274879834 DEBYE(S)

Additional context

Some tools would be helpful to facilitate error reporting and analysis, maybe some already exist:

  1. Print test results in a machine friendly format, so the failures can be more easily summarized
  2. Have a script that given the obtained and reference output files compares them and prints the summary of differences.
  3. While running the test suite, if a test fails, optionally cat the whole output. I'm not sure this can be done reliably
edoapra commented 11 months ago

Have a script that given the obtained and reference output files compares them and prints the summary of differences.

I am not sure how different is what you are suggesting from what the current (albeit unsophisticated) $NWCHEM_TOP/QA/nwparse.pl script does.

Could you please elaborate what is missing?

marcindulak commented 11 months ago

Yes, nwparse.pl solves

  1. Have a script that given the obtained and reference output files compares them and prints the summary of differences.

I see also that 1. (summarizing failures) may be extracted with

diff <(grep "Running tests/" build.log) <(grep -E "Running tests/|verifying output ... OK" build.log | grep "verifying output" -B 1 | grep Running)
6d5
<  Running tests/pyqa3/pyqa3 
41d39
<  Running tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk 
45d42
<  Running tests/h2o_dk/h2o_dk 
47d43
<  Running tests/pyqa3/pyqa3 
53d48
<  Running tests/prop_ch3f/prop_ch3f 
58d52
<  Running tests/h2o-campbe0/h2o-campbe0 
82d75
<  Running tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk

This shows an additional missing success on tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk - when started locally it's running for 30 minutes and it's not done yet. This means it gets killed due to the tests timeout I imposed timeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2. Maybe there is a smaller example that tests a similar functionality?

edoapra commented 11 months ago

This shows an additional missing success on tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk - when started locally it's running for 30 minutes and it's not done yet. This means it gets killed due to the tests timeout I imposed timeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2. Maybe there is a smaller example that tests a similar functionality?

Sure, that's the purpose of all the other tests in dolibxctests.mpi. You could apply a patch that skips libxc_waterdimer_bmk

edoapra commented 11 months ago

Any luck in inspecting the mpich output files?

marcindulak commented 11 months ago

I'm trying to print the outputs on stdout during the build.

marcindulak commented 11 months ago

Here are the builds that print outputs of the failed tests.

7.2.2 https://github.com/nwchemgit/nwchem/commit/74936fb92aec6990ce48bf334747215813684576 https://koji.fedoraproject.org/koji/taskinfo?taskID=108590680

master https://github.com/nwchemgit/nwchem/commit/45f057eca7e3bbe85815efe2bbcbbcacc6b95471 https://koji.fedoraproject.org/koji/taskinfo?taskID=108590838

If you want to keep the build.log files or src.rpm, please download them since they'll be removed at some point (I don't know when).

The outputs of the failed tests, and their nwparse.pl outputs are included in the build.log. First the summary of all tests is printed, then the out and out.nwparse files of the individual tests. I indicated where the outputs start with a long line of hashes '#'.

Could this mean that export HYDRA_DEBUG=0 does not help anymore https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753 in getting mpich to print all output? I would expect mpich to print all output, independently of the debug options.

Selected observations

1

libxc_tddftgrad_h2 fails with openmpi. While libxc_tddftgrad_h2.out looks complete and taking 5.9 seconds, libxc_tddftgrad_h2.out.nwparse is not generated and libxc_tddftgrad_h2.err is empty. I think this may be due to the timeout timeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2

2

h2o_opt fails with mpich

@@@     Comparison of Output Files
@@ -267,3 +267,2 @@
           H      0.000      0.000      0.000
           H      0.078      0.105      0.000
-Effective nuclear repulsion energy (a.u.) 8.30

The h2o_opt.out extracted from the build.log is attached to my comment h2o_opt.log

The very last Effective nuclear repulsion energy (a.u.) is missing in the output, and the place where it should appear looks like this

      Atomic Mass

      -----------
      O                 15.994910
      H                  1.007825
Sending upstream hdr.cmd = CMD_STDOUT
            Nuclear Dipole moment (a.u.)
            ----------------------------

3

prop_ch3f fails with mpich

@@@     Comparison of Output Files
@@ -85,6 +85,4 @@
 isotropic = 29.154
 anisotropy = 7.162
-isotropic = 29.154
-anisotropy = 7.162
 Total DFT energy = -139.75155
 DMX 0.000
...

I'm not analysing all missing lines, but see this one which causes isotropic = 29.154 to be missing

        Total Shielding Tensor
     29.5844     -1.0200     -2.8262
     -1.0200     26.0511      0.7573
     -2.8262      0.7573     31.8258
Sending upstream hdr.cmd = CMD_STDOUT
          anisotropy =       7.1624

4

libxc_he2p_wb97 fails with mpich

@@@     Comparison of Output Files
@@ -1,3 +1,2 @@
-Effective nuclear repulsion energy (a.u.) 0.00
 Effective nuclear repulsion energy (a.u.) 1.76
 Effective nuclear repulsion energy (a.u.) 1.76

I'm not analysing this, but expect again a Sending upstream hdr.cmd = CMD_STDOUT.

In case we need to do it again, this is the build macro that was used

%global docheck() \
cp -rp QA.orig QA&& \
cd QA&& \
export LD_LIBRARY_PATH=${MPI_LIB}&& \
export PATH=%{PKG_TOP}/bin/$NWCHEM_TARGET:${MPI_BIN}:${PATH}&& \
export MPIRUN_PATH=${MPI_BIN}/mpiexec&& \
export MPIRUN_NPOPT="-verbose -np" && \
export USE_LIBXC=True && \
export NWCHEM_EXECUTABLE=%{PKG_TOP}/bin/$NWCHEM_TARGET/nwchem$MPI_SUFFIX&& \
timeout ${TIMEOUT_OPTS} time ./doafewqmtests.mpi ${NPROC} 2>&1 < /dev/null | tee ../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
mv testoutputs ../testoutputs.doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
timeout ${TIMEOUT_OPTS} time ./dolibxctests.mpi ${NPROC} 2>&1 < /dev/null | tee ../dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
mv testoutputs ../testoutputs.dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
BUILD_LOG=../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
TESTOUTPUTS=../testoutputs.doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
ls -al ${TESTOUTPUTS}&& \
for f in $(diff <(grep "Running tests/" ${BUILD_LOG}) <(grep -E "Running tests/|verifying output ... OK" ${BUILD_LOG} | grep "verifying output" -B 1 | grep Running) | grep Running | cut -d' ' -f 4); do printf '#%.0s' {1..80} && echo && NAME=$(basename ${f}) && echo ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && cat ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && if test -r ${TESTOUTPUTS}/${NAME}.out.nwparse; then cat ${TESTOUTPUTS}/${NAME}.out.nwparse; else cat ${TESTOUTPUTS}/${NAME}.err; fi; done&& \
BUILD_LOG=../dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
TESTOUTPUTS=../testoutputs.dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
ls -al ${TESTOUTPUTS}&& \
for f in $(diff <(grep "Running tests/" ${BUILD_LOG}) <(grep -E "Running tests/|verifying output ... OK" ${BUILD_LOG} | grep "verifying output" -B 1 | grep Running) | grep Running | cut -d' ' -f 4); do printf '#%.0s' {1..80} && echo && NAME=$(basename ${f}) && echo ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && cat ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && if test -r ${TESTOUTPUTS}/${NAME}.out.nwparse; then cat ${TESTOUTPUTS}/${NAME}.out.nwparse; else cat ${TESTOUTPUTS}/${NAME}.err; fi; done&& \
cd ..&& \
rm -rf QA
edoapra commented 11 months ago

I had a look at the prop_ch3f.out from https://kojipkgs.fedoraproject.org//work/tasks/768/108590768/build.log The missing lines of the output are overwritten from what could be the output of the verbose option of mpiexec

Sending upstream hdr.cmd = CMD_STDOUT
edoapra commented 11 months ago

Could you try the mpich tests again using the following settings (that revert my 2018 commit https://github.com/edoapra/fedpkg/commit/81f0c46970408f1a6e91c9d70bff7ee9d2707705)

export MPIRUN_NPOPT=" -np" && \
export NWCHEM_EXECUTABLE=$NWCHEM_TOP/bin/$NWCHEM_TARGET/nwchem$MPI_SUFFIX&& \
timeout --preserve-status --kill-after 10 1800 time ./doafewqmtests.mpi ${NPROC} 2>&1  | tee ../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& 

This is of course equivalent to apply with the command patch -R the file https://github.com/edoapra/fedpkg/commit/81f0c46970408f1a6e91c9d70bff7ee9d2707705.patch

edoapra commented 11 months ago

If the step describe above at https://github.com/nwchemgit/nwchem/issues/895#issuecomment-1800396409 does not make prop_ch3f to pass, I would suggest simply changing the MPIRUN_NPOPT value and not the actual long doafewqmtests.mpi command

marcindulak commented 11 months ago

Setting export MPIRUN_NPOPT="-np" && \ seems to help https://koji.fedoraproject.org/koji/taskinfo?taskID=108756405.

I think we can close the issue. Maybe mpich needs to be notified about a truncated output when -verbose option is on?

edoapra commented 11 months ago

Setting export MPIRUN_NPOPT="-np" && \ seems to help https://koji.fedoraproject.org/koji/taskinfo?taskID=108756405.

I think we can close the issue.

Great