Closed marcindulak closed 11 months ago
Have a script that given the obtained and reference output files compares them and prints the summary of differences.
I am not sure how different is what you are suggesting from what the current (albeit unsophisticated) $NWCHEM_TOP/QA/nwparse.pl script does.
Could you please elaborate what is missing?
Yes, nwparse.pl solves
- Have a script that given the obtained and reference output files compares them and prints the summary of differences.
I see also that 1. (summarizing failures) may be extracted with
diff <(grep "Running tests/" build.log) <(grep -E "Running tests/|verifying output ... OK" build.log | grep "verifying output" -B 1 | grep Running)
6d5
< Running tests/pyqa3/pyqa3
41d39
< Running tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk
45d42
< Running tests/h2o_dk/h2o_dk
47d43
< Running tests/pyqa3/pyqa3
53d48
< Running tests/prop_ch3f/prop_ch3f
58d52
< Running tests/h2o-campbe0/h2o-campbe0
82d75
< Running tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk
This shows an additional missing success on tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk
- when started locally it's running for 30 minutes and it's not done yet.
This means it gets killed due to the tests timeout I imposed timeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2
. Maybe there is a smaller example that tests a similar functionality?
This shows an additional missing success on
tests/libxc_waterdimer_bmk/libxc_waterdimer_bmk
- when started locally it's running for 30 minutes and it's not done yet. This means it gets killed due to the tests timeout I imposedtimeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2
. Maybe there is a smaller example that tests a similar functionality?
Sure, that's the purpose of all the other tests in dolibxctests.mpi
. You could apply a patch that skips libxc_waterdimer_bmk
Any luck in inspecting the mpich output files?
I'm trying to print the outputs on stdout during the build.
Here are the builds that print outputs of the failed tests.
7.2.2 https://github.com/nwchemgit/nwchem/commit/74936fb92aec6990ce48bf334747215813684576 https://koji.fedoraproject.org/koji/taskinfo?taskID=108590680
master https://github.com/nwchemgit/nwchem/commit/45f057eca7e3bbe85815efe2bbcbbcacc6b95471 https://koji.fedoraproject.org/koji/taskinfo?taskID=108590838
If you want to keep the build.log files or src.rpm, please download them since they'll be removed at some point (I don't know when).
The outputs of the failed tests, and their nwparse.pl outputs are included in the build.log. First the summary of all tests is printed, then the out and out.nwparse files of the individual tests. I indicated where the outputs start with a long line of hashes '#'.
Could this mean that export HYDRA_DEBUG=0
does not help anymore https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753 in getting mpich to print all output? I would expect mpich to print all output, independently of the debug options.
libxc_tddftgrad_h2 fails with openmpi. While libxc_tddftgrad_h2.out looks complete and taking 5.9 seconds, libxc_tddftgrad_h2.out.nwparse is not generated and libxc_tddftgrad_h2.err is empty. I think this may be due to the timeout timeout --preserve-status --kill-after 10 1800 time ./dolibxctests.mpi 2
h2o_opt fails with mpich
@@@ Comparison of Output Files
@@ -267,3 +267,2 @@
H 0.000 0.000 0.000
H 0.078 0.105 0.000
-Effective nuclear repulsion energy (a.u.) 8.30
The h2o_opt.out extracted from the build.log is attached to my comment h2o_opt.log
The very last Effective nuclear repulsion energy (a.u.)
is missing in the output, and the place where it should appear looks like this
Atomic Mass
-----------
O 15.994910
H 1.007825
Sending upstream hdr.cmd = CMD_STDOUT
Nuclear Dipole moment (a.u.)
----------------------------
prop_ch3f fails with mpich
@@@ Comparison of Output Files
@@ -85,6 +85,4 @@
isotropic = 29.154
anisotropy = 7.162
-isotropic = 29.154
-anisotropy = 7.162
Total DFT energy = -139.75155
DMX 0.000
...
I'm not analysing all missing lines, but see this one which causes isotropic = 29.154
to be missing
Total Shielding Tensor
29.5844 -1.0200 -2.8262
-1.0200 26.0511 0.7573
-2.8262 0.7573 31.8258
Sending upstream hdr.cmd = CMD_STDOUT
anisotropy = 7.1624
libxc_he2p_wb97 fails with mpich
@@@ Comparison of Output Files
@@ -1,3 +1,2 @@
-Effective nuclear repulsion energy (a.u.) 0.00
Effective nuclear repulsion energy (a.u.) 1.76
Effective nuclear repulsion energy (a.u.) 1.76
I'm not analysing this, but expect again a Sending upstream hdr.cmd = CMD_STDOUT
.
In case we need to do it again, this is the build macro that was used
%global docheck() \
cp -rp QA.orig QA&& \
cd QA&& \
export LD_LIBRARY_PATH=${MPI_LIB}&& \
export PATH=%{PKG_TOP}/bin/$NWCHEM_TARGET:${MPI_BIN}:${PATH}&& \
export MPIRUN_PATH=${MPI_BIN}/mpiexec&& \
export MPIRUN_NPOPT="-verbose -np" && \
export USE_LIBXC=True && \
export NWCHEM_EXECUTABLE=%{PKG_TOP}/bin/$NWCHEM_TARGET/nwchem$MPI_SUFFIX&& \
timeout ${TIMEOUT_OPTS} time ./doafewqmtests.mpi ${NPROC} 2>&1 < /dev/null | tee ../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
mv testoutputs ../testoutputs.doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
timeout ${TIMEOUT_OPTS} time ./dolibxctests.mpi ${NPROC} 2>&1 < /dev/null | tee ../dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
mv testoutputs ../testoutputs.dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
BUILD_LOG=../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
TESTOUTPUTS=../testoutputs.doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&& \
ls -al ${TESTOUTPUTS}&& \
for f in $(diff <(grep "Running tests/" ${BUILD_LOG}) <(grep -E "Running tests/|verifying output ... OK" ${BUILD_LOG} | grep "verifying output" -B 1 | grep Running) | grep Running | cut -d' ' -f 4); do printf '#%.0s' {1..80} && echo && NAME=$(basename ${f}) && echo ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && cat ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && if test -r ${TESTOUTPUTS}/${NAME}.out.nwparse; then cat ${TESTOUTPUTS}/${NAME}.out.nwparse; else cat ${TESTOUTPUTS}/${NAME}.err; fi; done&& \
BUILD_LOG=../dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
TESTOUTPUTS=../testoutputs.dolibxctests.mpi.${NPROC}$MPI_SUFFIX.log&& \
ls -al ${TESTOUTPUTS}&& \
for f in $(diff <(grep "Running tests/" ${BUILD_LOG}) <(grep -E "Running tests/|verifying output ... OK" ${BUILD_LOG} | grep "verifying output" -B 1 | grep Running) | grep Running | cut -d' ' -f 4); do printf '#%.0s' {1..80} && echo && NAME=$(basename ${f}) && echo ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && cat ${TESTOUTPUTS}/${NAME}.out && printf '#%.0s' {1..80} && echo && if test -r ${TESTOUTPUTS}/${NAME}.out.nwparse; then cat ${TESTOUTPUTS}/${NAME}.out.nwparse; else cat ${TESTOUTPUTS}/${NAME}.err; fi; done&& \
cd ..&& \
rm -rf QA
I had a look at the prop_ch3f.out from https://kojipkgs.fedoraproject.org//work/tasks/768/108590768/build.log The missing lines of the output are overwritten from what could be the output of the verbose option of mpiexec
Sending upstream hdr.cmd = CMD_STDOUT
Could you try the mpich tests again using the following settings (that revert my 2018 commit https://github.com/edoapra/fedpkg/commit/81f0c46970408f1a6e91c9d70bff7ee9d2707705)
export MPIRUN_NPOPT=" -np" && \
export NWCHEM_EXECUTABLE=$NWCHEM_TOP/bin/$NWCHEM_TARGET/nwchem$MPI_SUFFIX&& \
timeout --preserve-status --kill-after 10 1800 time ./doafewqmtests.mpi ${NPROC} 2>&1 | tee ../doafewqmtests.mpi.${NPROC}$MPI_SUFFIX.log&&
This is of course equivalent to apply with the command patch -R
the file
https://github.com/edoapra/fedpkg/commit/81f0c46970408f1a6e91c9d70bff7ee9d2707705.patch
If the step describe above at https://github.com/nwchemgit/nwchem/issues/895#issuecomment-1800396409 does not make prop_ch3f to pass, I would suggest simply changing the MPIRUN_NPOPT
value and not the actual long doafewqmtests.mpi
command
Setting export MPIRUN_NPOPT="-np" && \
seems to help https://koji.fedoraproject.org/koji/taskinfo?taskID=108756405.
I think we can close the issue. Maybe mpich needs to be notified about a truncated output when -verbose
option is on?
Setting
export MPIRUN_NPOPT="-np" && \
seems to help https://koji.fedoraproject.org/koji/taskinfo?taskID=108756405.I think we can close the issue.
Great
Describe the bug
This is a follow up on #892. It collects the failures encountered when building nwchem 7.2.1 https://github.com/nwchemgit/nwchem/commit/487f8b945fbe9cedf02757dacddc66d40dc74ed9 on fedora 39, with mpich.
It does not look like the found errors are reproducible, maybe some are.
Describe settings used
The build is available at https://koji.fedoraproject.org/koji/taskinfo?taskID=107865675
Attach log files
List the failures on x86_64, this does not include crashes
See the build.log file for the reported diffs, pasted below manually.
1.
2.
3.
To Reproduce
It seems like the errors obtained when running on fedora builders are nor reproduced using the steps below.
Use the following Dockerfile, providing
TEST
and an argument, for exampledocker build --build-arg TEST=tests/h2o_dk/h2o_dk --progress=plain -t test:latest . 2>&1 | cut -d' ' -f3- > out.txt
FI_PROVIDER=tcp
is to avoid the error below, I found this suggestion here https://github.com/open-mpi/ompi/issues/11295#issuecomment-1382190063I'm unable the reproduce failure 1. (tests/h2o_dk/h2o_dk), which suggests that
Effective nuclear repulsion energy (a.u.) 9.01
is missing in the generated output?I'm unable to reproduce failure 2., which suggests that a line like
-XX -8.702 0.000 -8.702
is missing in the generated output.I may reproduce failure 3. (or maybe still not?) - it looks like sometimes minus precedes -0.0000000000 as in
DMY -0.0000000000
or in other components, while the reference file contains 0.0000000000.Additional context
Some tools would be helpful to facilitate error reporting and analysis, maybe some already exist: