trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 563 forks source link

Framework: Some PR build errors showing up as strange 'cmake --build . --config Release -- -j29 -k 0' errors #10823

Closed bartlettroscoe closed 1 year ago

bartlettroscoe commented 2 years ago

Bug Report

@trilinos/framework

Next Action Status

This is due to a defect in CTest introduced in CMake 3.18. The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ...

Internal issues

Description

As shown in this query showing:

image

the new Trilinos Framework GenConfig build rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables is failing with a build error reported in the Zoltan2Sphynx package showing:

"/projects/sems/install/rhel7-x86_64/sems/v2/utility/cmake/3.21.1/gcc/7.3.0/mxfpluq/bin/cmake" "--build" "." "--config" "Release" "--" "-j29" "-k" "0"

returning error code 1.

As you can see, this is currently failing in the "Master Merge" builds for promotion PRs #10820 and #10797 so this error has nothing to do with a given PR branch, this is impacting 'develop' and will impact everyone's PRs. The reason I saw it is because it took out my last PR iteration https://github.com/trilinos/Trilinos/pull/10813#issuecomment-1202330429 for PR #10813.

Steps to Reproduce

Run a PR build.

bartlettroscoe commented 2 years ago

CC: @tcclevenger

And this error just took out a PR testing iteration for PR #10751 shown in the build:

bartlettroscoe commented 2 years ago

CC: @fryeguy52

FYI: I searched the Trilinos 'develop' branch as of commit 1cd5bae44bc:

* 1cd5bae44bc "Merge pull request #10809 from cgcgcg/cxxStandard"
| Author: Christian Glusa <cgcgcg@users.noreply.github.com>
| Date:   Wed Aug 3 08:34:02 2022 -0600 (63 minutes ago)
| 
| M     cmake/ctest/drivers/enigma/TrilinosCTestDriverCore.enigma.gcc.cmake
| M     cmake/ctest/drivers/geminga/TrilinosCTestDriverCore.geminga.gcc-cuda.cmake
| M     cmake/ctest/drivers/geminga/TrilinosCTestDriverCore.geminga.gcc.cmake
| M     cmake/ctest/drivers/lightsaber/TrilinosCTestDriverCore.lightsaber.gcc.cmake
| M     cmake/ctest/drivers/rocketman/TrilinosCTestDriverCore.rocketman.gcc.cmake
| M     cmake/ctest/drivers/trappist/TrilinosCTestDriverCore.trappist.clang.cmake
| M     cmake/ctest/drivers/trappist/TrilinosCTestDriverCore.trappist.gcc.cmake

and I did a search to try to find the code that is generated this command:

cmake --build . --config Release -- -j29 -k 0

by running:

$ cd Trilinos/

$ find . -type f -exec grep -nH "[-][-]build [.]" {} \; | grep -v /TriBITS/ | grep -v cmake/tribits/
./kokkos/appveyor.yml:9:    cmake --build . --target install &&
./seacas/.appveyor.yml:83:  - cmd: cmake --build . --config %configuration% -- /maxcpucount:4
./packages/kokkos/appveyor.yml:9:    cmake --build . --target install &&
./packages/sacado/test/GTestSuite/googletest/appveyor.yml:111:    & cmake --build . --config $env:configuration -- $cmake_parallel
./packages/sacado/test/GTestSuite/googletest/googletest/README.md:106:execute_process(COMMAND ${CMAKE_COMMAND} --build .

The closest match above is:

./packages/sacado/test/GTestSuite/googletest/appveyor.yml:111:    & cmake --build . --config $env:configuration -- $cmake_parallel

That name appveyor gets mentioned in:

$ cd packages/sacado/test/GTestSuite/googletest/

$ find . -type f -exec grep -nH appveyor {} \;
./appveyor.yml:63:        appveyor DownloadFile https://github.com/bazelbuild/bazel/releases/download/0.28.1/bazel-0.28.1-windows-x86_64.exe -FileName bazel.exe
./README.md:6:[![Build status](https://ci.appveyor.com/api/projects/status/4o38plt0xbo1ubc8/branch/master?svg=true)](https://ci.appveyor.com/project/GoogleTestAppVeyor/googletest/branch/master)

But looking at the line from appveyor.yml is shows:

    $cmake_parallel = if ($env:generator -eq "MinGW Makefiles") {"-j2"} else  {"/m"}
    & cmake --build . --config $env:configuration -- $cmake_parallel

Well, that does not match the the signature:

cmake --build . --config Release -- -j29 -k 0

It seems that file appveyor.yml is a configuration file for a tool appveyor which is used as part of a CI/CD system called AppVoyer. The main website https://www.appveyor.com/ shows that Google is one of their customers so I can't see how that would get run in Trilinos PR testing.

So I am stumped how this command is getting run as part of Trilinos PR testing.

I will see if I can reproduce these errors myself (word is that we should be able to which I will try out now).

bartlettroscoe commented 2 years ago

CC: @csiefer2, @e10harvey

NOTE: The builds that show these errors are similar to those that show "6" build errors reported in #10836 in that they zero tests "Not Run", "Fail" and "Pass". These are a little harder to search for on CDash but this query seems to select them.

Looking over this set of builds, we see different types of errors reported for the command:

"<base-dir>/cmake" "--build" "." "--config" "Debug" "--" "-j20" "-k" "0"

These look like real build errors in Trilinos but they are not being reported correctly with each package. Instead, they are just reported for the outer cmake --build . command.

Here are some examples of different build errors reported:

1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h :

/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp:54:10: fatal error: Trilinos_Util_CrsMatrixGallery.h: No such file or directory
 #include "Trilinos_Util_CrsMatrixGallery.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

NOTE: All of the examples above are from the builds rhel7_sems-gnu-7.2.0 or rhel7_sems-gnu-8.3.0!

2. MueLu_Test_ETI.hpp ISO C++ forbids declaration of ‘type name’ with no type:

In file included from /scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/Driver_Structured.cpp:437:0: /scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/../unit_tests/MueLu_Test_ETI.hpp: In function ‘bool Automatic_Test_ETI(int, char**)’:
/scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/../unit_tests/MueLu_Test_ETI.hpp:91:31: error: ISO C++ forbids declaration of ‘type name’ with no type [-fpermissive]
   Teuchos::RCP<const Teuchos::MpiComm<int> > comm = Teuchos::rcp_dynamic_cast<const Teuchos::MpiComm<int> >(Teuchos::DefaultComm<int>::getComm());
                               ^~~~~~~

NOTE: All of the examples above are from the builds rhel7_sems-gnu-7.2.0!

3. ninja: error: loading 'build.ninja': No such file or directory:

ninja: error: loading 'build.ninja': No such file or directory

4. No error output:

bartlettroscoe commented 2 years ago

Note, we see the error:

1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h

above being cleanly reported on the 'vortex' builds with the Thyra package shown here impacting PRs #10834, #10802, #10801, and #10751.

What I think is happening is that the same build error for the 'ascic' builds with the 'gnu-7.2.0' and 'gnu-8.3.0' builds is getting reported through the command cmake --build . --config Release -- -j29 -k 0 in a way that I don't understand.

We need to see if we can reproduce this build error on one of the 'ascic' builds locally.

bartlettroscoe commented 2 years ago

FYI: I am trying to reproduce the 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h error for the build:

on the machine 'hpws055'.

bartlettroscoe commented 2 years ago

FYI: I tried to reproduce the build error 1.EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h for the build:

from the machine 'hpws055' and I was not successful in doing so. All of Thyra built just fine, including the executable Thyra_EpetraOperatorWrapper_UnitTests. However, the tests all crash showing:

... lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers

It appears you can't reproduce Trilinos PR builds on HPWS machines at SNL :-(

I will try reproducing on a real 'ascicgpu' machine.

Attempt to reproduce 'EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h' build error with 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1' build on 'hpws055' Details: (click to expand)
Trying to reproduce the build error "EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h" on the machine 'hpws055'. The repo version is: ``` $ ssh hpws055 $ cd /fgs/rabartl/Trilinos.base/Trilinos/ $ gitdist-status --dist-repos=. ---------------------------------------------------------------- | ID | Repo Dir | Branch | Tracking Branch | C | M | ? | |----|-----------------|---------|-----------------|---|---|---| | 0 | Trilinos (Base) | develop | github/develop | | | | ---------------------------------------------------------------- $ gitdist-repo-versions --dist-repos=. *** Base Git Repo: Trilinos 7256b6e3b61225859d96d22aed7757b446144861 [Fri Aug 5 15:08:43 2022 -0600] Merge pull request #10814 from iyamazaki/amesos2-pardiso ``` Doing the configure, build, and test with: ``` $ ssh hpws055 $ cd /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/ $ cat load-env-and-cmake-frag-file.sh if [[ -e GenConfigSettings.cmake ]] ; then echo "Remvoing existing file GenConfigSettings.cmake ..." rm GenConfigSettings.cmake fi source /fgs/rabartl/Trilinos.base/Trilinos/packages/framework/GenConfig/gen-config.sh \ --cmake-fragment GenConfigSettings.cmake \ rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables \ --force \ "$@" $ cat do-configure if [[ -e CMakeCache.txt ]] ; then echo "Removing CMakeCache.txt ..." rm CMakeCache.txt fi if [[ -d CMakeFiles ]] ; then echo "Removing CMakeFiles ..." rm -r CMakeFiles fi cmake \ -G Ninja \ -C GenConfigSettings.cmake \ -D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \ -D Trilinos_ENABLE_TESTS=ON \ -D Trilinos_TRACE_ADD_TEST=ON \ "$@" \ /fgs/rabartl/Trilinos.base/Trilinos $ . load-env-and-cmake-frag-file.sh Setting system to 'rhel7' based on specification in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. Matched environment name 'sems-gnu-7.2.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. Matched complete configuration 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables' for build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. * CMake fragment file written to: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/GenConfigSettings.cmake $ time ./do-configure -DTrilinos_ENABLE_Thyra=ON &> configure.out && time ninja -j14 &> make.out && time ctest -j14 &> ctest.out real 0m13.211s user 0m6.981s sys 0m5.088s real 0m0.215s user 0m0.049s sys 0m0.043s real 0m4.059s user 0m14.974s sys 0m9.000s $ grep "failed out of" ctest.out 2% tests passed, 80 tests failed out of 82 ``` Well, everything built but I got a bunch of test failures. It seems the problem is: ``` $ ctest -VV -R "^ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1$" ... test 1 Start 1: ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1 1: Test command: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/bin/mpirun "--bind-to" "none" "-np" "1" "/fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/packages/thyra/core/test/nonlinear/models/UnitTests/ThyraCore_Simple2DModelEvaluatorUnitTests.exe" 1: Test timeout computed to be: 1500 1: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/packages/thyra/core/test/nonlinear/models/UnitTests/ThyraCore_Simple2DModelEvaluatorUnitTests.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers 1: ------------------------------------------------------- 1: Primary job terminated normally, but 1 process returned 1: a non-zero exit code.. Per user-direction, the job has been aborted. 1: ------------------------------------------------------- 1: -------------------------------------------------------------------------- 1: mpirun detected that one or more processes exited with non-zero status, thus causing 1: the job to be terminated. The first process to do so was: 1: 1: Process name: [[42265,1],0] 1: Exit code: 127 1: -------------------------------------------------------------------------- 1/1 Test #1: ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1 ...***Failed Required regular expression not found. Regex=[End Result: TEST PASSED ] 0.26 sec ... ``` Hum, it seems you can't reproduce Trilinos PR build test results from an HPWS machine :-(
bartlettroscoe commented 2 years ago

FYI: I tried to reproduce the build error 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h for the build:

from the machine 'ascicgpu17' and I was not successful in doing so. All of Thyra built just fine, including the executable Thyra_EpetraOperatorWrapper_UnitTests and all of the tests ran successfully. That submitted to CDash here and showed all 82 passing Thyra tests.

Attempt to reproduce 'EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h' build error with 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1' build on 'ascicgpu17' Details: (click to expand)
Trying to reproduce the build error "EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h" on the machine 'ascicgpu17'. The repo version is: ``` $ ssh ascicgpu17 $ cd /fgs/rabartl/Trilinos.base/Trilinos/ $ gitdist-status --dist-repos=. ---------------------------------------------------------------- | ID | Repo Dir | Branch | Tracking Branch | C | M | ? | |----|-----------------|---------|-----------------|---|---|---| | 0 | Trilinos (Base) | develop | github/develop | | | | ---------------------------------------------------------------- $ gitdist-repo-versions --dist-repos=. *** Base Git Repo: Trilinos 7256b6e3b61225859d96d22aed7757b446144861 [Fri Aug 5 15:08:43 2022 -0600] Merge pull request #10814 from iyamazaki/amesos2-pardiso ``` Doing the configure, build, and test with: ``` $ ssh ascicgpu $ cd /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/ $ cat load-env-and-cmake-frag-file.sh if [[ -e GenConfigSettings.cmake ]] ; then echo "Remvoing existing file GenConfigSettings.cmake ..." rm GenConfigSettings.cmake fi source /fgs/rabartl/Trilinos.base/Trilinos/packages/framework/GenConfig/gen-config.sh \ --cmake-fragment GenConfigSettings.cmake \ rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables \ --force \ "$@" $ cat do-configure if [[ -e CMakeCache.txt ]] ; then echo "Removing CMakeCache.txt ..." rm CMakeCache.txt fi if [[ -d CMakeFiles ]] ; then echo "Removing CMakeFiles ..." rm -r CMakeFiles fi cmake \ -G Ninja \ -C GenConfigSettings.cmake \ -D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \ -D Trilinos_ENABLE_TESTS=ON \ -D Trilinos_TRACE_ADD_TEST=ON \ "$@" \ /fgs/rabartl/Trilinos.base/Trilinos $ script load-env-and-cmake-frag-file.out Script started, file is load-env-and-cmake-frag-file.out [rabartl@ascicgpu17 rhel7_sems-gnu-7.2.0-openmpi-1.10.1]$ . load-env-and-cmake-frag-file.sh Remvoing existing file GenConfigSettings.cmake ... Using system 'rhel7' based on matching hostname 'ascicgpu17'. Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. Matched environment name 'sems-gnu-7.2.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. Matched complete configuration 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables' for build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'. * CMake fragment file written to: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/GenConfigSettings.cmake ./do-configure -DTrilinos_ENABLE_Thyra=ON &> configure.out && time make dashboard.out &> make.dashboard.out real 0m17.319s user 0m6.755s sys 0m7.218s real 3m53.695s user 29m12.527s sys 5m55.026s $ grep "failed out of" make.dashboard.out 100% tests passed, 0 tests failed out of 82 ```
bartlettroscoe commented 2 years ago

FYI: There is independent confirmation in new issue #10842 of the error 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h. I will move my analysis of this error over to that issue.

NOTE: My current hypothesis is that an older version of Trilinos from a couple of weeks ago showed this error but has since been fixed on 'develop'. I will test that hypothesis out and document findings in #10842.

bartlettroscoe commented 2 years ago

FYI: There is another clue in https://github.com/trilinos/Trilinos/issues/10842#issuecomment-1208600267. It seems that you might see the error 1. EpetraOperatorWrapper_UnitTests.cpp missing Trilinos_Util_CrsMatrixGallery. when running out of disk space.

bartlettroscoe commented 2 years ago

FYI: I made my last very careful effort to reproduce the EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h error in https://github.com/trilinos/Trilinos/issues/10842#issuecomment-1208682513 for the 'vortex' build for PR #10808 and I was not able to do so (i.e. it passed the build).

jhux2 commented 2 years ago

Note that this issue is also tracking what was reported in #10906.

"In some PR testing, compile failures are erroneously showing up under the subproject Zoltan2Sphyx."

bartlettroscoe commented 2 years ago

FYI: Still no XML files being archived in the Jenkins jobs to allow us to debug what is causing this behavior. See TRILINOSHD-188.

bartlettroscoe commented 2 years ago

FYI: We are still seeing a bunch of these cases where errors are reported to Zoltan2Sphynx as seen here over the last 2 days with 7 PR iterations showing failures:

image

bartlettroscoe commented 2 years ago

FYI: See https://github.com/trilinos/Trilinos/issues/10836#issuecomment-1230841804 and https://github.com/trilinos/Trilinos/issues/10836#issuecomment-1230849644.

bartlettroscoe commented 2 years ago

CC: @e10harvey, @zackgalbreath

FYI: The problem of reporting the global cmake --build . [other arguments] command does not seem to be solved. In the build PR-10962-test-rhel7_sems-gnu-7.2.0-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_no-mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1025 that just ran an hour ago, it shows the build errors:

image

which shows a build error in the example object file:

packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o

Why is that build error not being reported along with the Compadre?

The Build.xml file archived in:

shown here is given below.

What is strange about these two build errors is that they are for the same Compadre build error:

/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp: In function ‘int main(int, char**)’:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:9: error: iteration 2147483647 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
         for (int j=0; j&lt;dimension-1; ++j) {
         ^~~
[CTest: warning matched] /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:24: note: within this loop
         for (int j=0; j&lt;dimension-1; ++j) {
                       ~^~~~~~~~~~~~
[CTest: warning matched] cc1plus: all warnings being treated as errors

and the Build.xml file shows two entries for the same build error. It is almost like the ctest -S process is running the build twice: once with launchers turned on and a follow up build with launchers turned off.

The second failure for the global cmake --build command entry in the XML file shows:

        <Failure type="Error">
            <!-- Meta-information about the build action -->
            <Action/>
            <!-- Details of command -->
            <Command>
                <WorkingDirectory>/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/pull_request_test</WorkingDirectory>
                <Argument>/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.19.1/bin/cmake</Argument>
                <Argument>--build</Argument>
                <Argument>.</Argument>
                <Argument>--config</Argument>
                <Argument>Debug</Argument>
                <Argument>--</Argument>
                <Argument>-j20</Argument>
                <Argument>-k</Argument>
                <Argument>0</Argument>
            </Command>
            <!-- Result of command -->
            <Result>
                <StdOut>[1/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_NumericTraits.cpp.o
[2/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
[3/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_MemorySpace.cpp.o
[4/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_MemoryPool.cpp.o
[5/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Spinwait.cpp.o
...
FAILED: packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o 
"/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.19.1/bin/ctest" --launch --target-name Compadre_GMLS_Manifold_Test --build-dir /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/pull_request_test/packages/compadre/examples --output packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o --source /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp --language CXX --filter-prefix "" -- /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/bin/g++  -I. -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples -Ipackages/compadre/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/basis -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/constraints -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/tpl -Ipackages/kokkos/core/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/core/src -Ipackages/kokkos -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/core/src/../../tpls/desul/include -Ipackages/kokkos/containers/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/containers/src -Ipackages/kokkos/algorithms/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/algorithms/src -Ipackages/kokkos-kernels/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/impl -Ipackages/kokkos-kernels/src/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/impl/tpls -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/blas -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/blas/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/sparse -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/sparse/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/graph -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/graph/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/dense -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/dense/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/sparse -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/sparse/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/common -isystem /projects/sems/install/rhel7-x86_64/sems/tpl/superlu/4.3/gcc/7.2.0/base/include -pedantic -Wall -Wno-long-long -Wwrite-strings -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-unused-but-set-variable -Wno-unused-variable -Wno-unused-label -Werror -DTRILINOS_HIDE_DEPRECATED_HEADER_WARNINGS   -O3 -DNDEBUG -std=c++14 -MD -MT packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o -MF packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o.d -o packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o -c /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp: In function ‘int main(int, char**)’:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:9: error: iteration 2147483647 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
         for (int j=0; j&lt;dimension-1; ++j) {
         ^~~
[CTest: warning matched] /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:24: note: within this loop
         for (int j=0; j&lt;dimension-1; ++j) {
                       ~^~~~~~~~~~~~
[CTest: warning matched] cc1plus: all warnings being treated as errors
[8514/13366] Building CXX object packages/stk/stk_util/stk_util/util/CMakeFiles/stk_util_util.dir/tokenize.cpp.o
[8515/13366] Building CXX object packages/stk/stk_util/stk_util/util/CMakeFiles/stk_util_util.dir/human_bytes.cpp.o
[8516/13366] Building CXX object packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/CPUTime.cpp.o
...
[13363/13366] Linking CXX executable packages/piro/test/Piro_ThyraSolver.exe
[13364/13366] Building CXX object packages/trilinoscouplings/examples/scaling/CMakeFiles/TrilinosCouplings_Example_Poisson_NoFE_Tpetra.dir/example_Poisson_NoFE_Tpetra.cpp.o
[13365/13366] Linking CXX executable packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_NoFE_Tpetra.exe
ninja: build stopped: cannot make progress due to previous errors.</StdOut>
                <StdErr/>
                <ExitCondition>1</ExitCondition>
            </Result>
        </Failure>

This is so strange.

bartlettroscoe commented 2 years ago

FYI: The behavior described above turns out the be a CTest defect. For details and to follow the fix, see:

Unfortunately, I think that means we will need to upgrade CMake/CTest on all client machines to fix this which will require waiting for CMake 3.25.0 in Jan 2023 (or perhaps a patch release of CMake 3.24).

Update: The fix is going to come out in CMake 3.23.3!

bartlettroscoe commented 1 year ago

FYI: The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ...

bartlettroscoe commented 1 year ago

With the upgrade of CMake 3.24.3 for all of the Trilinos PR builds yesterday, this should be resolved (see TRILINOSHD-228). For example, we are only seeing build errors for actual targets in the PR builds over the last day shown here and we see just the build error for the target:

Error building packages/seacas/libraries/ioss/src/exodus/CMakeFiles/Ioex.dir/Ioex_ParallelDatabaseIO.C.o

in the build PR-11309-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1484.

Using a version of CMake between versions 3.19 and 2.24.2 (inclusive), we would have seen that same error showing up along with the entire ninja build output for all targets (including all warnings that was the cause of #10836).

Closing this as complete.

Boy, that was a hard one to diagnose. But the fact that Kitware was willing to patch CMake 3.24.3, SEMS was willing to install CMake 3.24.3, and the Trilinos Framework team was willing and able to upgrade all of the PR builds is what allowed this to be fixed relatively quickly.