Closed bartlettroscoe closed 6 years ago
@bartlettroscoe Can we enable OpenMP but force OMP_NUM_THREADS=1
? Some of those Xpetra and MueLu "experimental" build options may not have any effect unless OpenMP is enabled.
Can we enable OpenMP but force
OMP_NUM_THREADS=1
? Some of those Xpetra and MueLu "experimental" build options may not have any effect unless OpenMP is enabled.
I guess I can try that. But I wonder even with OMP_NUM_THREADS=1
if all of the threads will be bound to the same core or not.
Also, note that there are ATDM builds of Trilinos that enable experimental MueLu code that build and run tests with a serial Kokkos node as shown at:
@csiefer2 would know for sure whether disabling OpenMP is adequate. My guess is no, because some of the sparse matrix-matrix multiply code takes different paths if OpenMP is enabled.
OpenMPNode and SerialNode trigger different code paths in chunks of Tpetra. AFAIK MueLu does not do node type specialization (except for Epetra).
What you choose to test for PR doesn't really matter, but they both need to stay working (more or less).
OpenMPNode and SerialNode trigger different code paths in chunks of Tpetra. AFAIK MueLu does not do node type specialization (except for Epetra).
What you choose to test for PR doesn't really matter, but they both need to stay working (more or less).
The GCC 4.8.4 PR build will test OpenMP path and Intel 17.x build will test the Serial node path. And the ATDM builds of Trilinos are already testing both paths and have been for many weeks now as you can see at:
@bartlettroscoe Cool, then I'm OK with this :)
I submitted PR #2467 to enable Xpetra and MueLu experimental code in the standard CI build. If someone can quickly review that, then I can merge.
I tested the full CI build going from OpenMPI 1.6.5 to 1.8.7 in the branch 2462-openmpi-1.6.5-to-1.8.7
in my fork of Trilinos git@github.com:bartlettroscoe/Trilinos.git
and it caused 30 tests to time out (see details below). I can't tell if these are hangs or just that MPI communication is taking longer. Someone would need to research that. In any case, we are a no-go for upgrading from OpenMPI 1.6.5 to 1.8.7.
I will try updating from OpenMPI 1.6.5 to 1.10.1 (which is the only other OpenMPI implementation that SEMS provides) and see how that goes.
@bartlettroscoe I have heard complaints about OpenMPI 1.8.x bugs. The OpenMPI web page considers it "retired" -- in fact, the oldest "not retired" version is 1.10.
@prwolfe Have you seen issues like this with OpenMPI 1.8.x?
I tested the full CI build going from OpenMPI 1.6.5 to 1.10.1 in the branch 2462-openmpi-1.6.5-to-1.10.1
in my fork of Trilinos git@github.com:bartlettroscoe/Trilinos.git
and it caused 34 tests to time out (see details below). I can't tell if these are hangs or just that MPI communication is taking longer to complete (which is hard to believe).
I am wondering if there is not some problem with the way these tests are using MPI and I am wondering if someone should not dig in and try to debug some of these timeouts to see why they are happening? Perhpas there are some real defects in the code that these updated versions of OpenMPI are bringing out?
I have heard complaints about OpenMPI 1.8.x bugs. The OpenMPI web page considers it "retired" -- in fact, the oldest "not retired" version is 1.10.
Okay, given that OpenMPI 1.10 is the oldest version of MPI that is supported, we should try to debug what is causing these timeouts. I will submit an experimental build to CDash and then we can go from there.
we had lots of issues with 1.8 - that's why we abandoned it. Basically it was slow and would not properly place processes. In fact we have had some issues with 1.10 but those responded well to placement directives.
I remember the "let's try 1.8 .... oh that was bad let's not" episode :(
I merged #2467 which enables experimental code in Xpetra and MueLu in the GCC 4.8.4 CI build.
I ran the full Trilinos CI build and test suites with OpenMPI 1.6.5 (the current version used) and OpenMPI 1.10.1 on my machine crf450 and submitted to CDash using an all-at-once configure, build, and test:
The machine was loaded by another builds so I don't totally trust the timing numbers is showed but it seems that some tests and package test suites run much faster with OpenMPI 1.10.1 and others run much slower with OpenMPI 1.10.1 vs. OpenMPI 1.6.5 but overall the tests took:
53m5s
run with ctest -j4
1h8m32s
with ctest -j4
You can see some of the detailed numbers on the CDash pages above and in the below notes.
I rebootted my machine crf450 and I will run these again and see what happens. But if I see numbers similar to this again, I will post a new Trilinos GitHub issue to focus on problems with Trilinos with OpenMPI 1.10.1.
The PR #2609 provides a single *.cmake file to to define this build. The auto PR tester driver bash script just needs to source:
$ source <trilinos-base-dir>/cmake/load_sems_dev_env.sh
and then the ctest -S <script>
driver script just needs the argument:
-C <trilinos-base-dir>/cmake/std/MpiReleaseDebugSharedPtSerial.cmake
and that is it.
The most important settings that we don't want to duplicate all over the place that are included in this file are from the files SEMSDevEnv.cmake
and BasicCiTestingSettings.cmake
.
Now that the GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
build is 100% clean as described in https://github.com/trilinos/Trilinos/issues/2691#issuecomment-393184370, I will change this over to be the new CI build and be used as the default build for the checkin-test-sems.sh
script.
@trilinos/framework,
This build is now ready to be used to replace the existing GCC 4.8.4 auto PR build. The build GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
completely matches the agreed-to GCC 4.8.4 in #2317.
The post-push CI build linked to from:
is now set to the updated GCC 4.8.4 + Intel 1.10.1 + OpenMP build and it finished the initial build this morning of all 53 packages passing all 2722 tests. And it ran all of these tests in a wall-clock time of 24m 56s (on 8 cores).
@trililinos/framework, I this this build should be ready to substitute for the existing GCC 4.8.4 auto PR build. Should we open a new GitHub issue for that?
Otherwise, I am putting this in review.
Given that issue #2788 exists for using this configuration for that auto PR GCC 4.8.4 build, I am closing this issue #2462 since there is nothing left to do. This updated configuration is being used in the post-push CI build so we will get an email if there are any failures going forward.
CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott
Next Action Status
Post-push CI build and checkin-test-sems.sh script is now updated to use updated GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build. Consideration for using this build in auto PR testing being addressed in #2788.
Description
This Issue is to scope out and track efforts to upgrade the existing SEMS-based Trilinos CI build (see #482 and #1304) to match the selected GCC 4.8.4 auto PR build as described in https://github.com/trilinos/Trilinos/issues/2317#issuecomment-376551457. The existing GCC 4.8.4 CI build shown here has been running for 1.5+ years and has been maintained over that time. That build has many but not all of the settings of the selected GCC 4.8.4 auto PR build listed here. The primary changes that need to be made are:
Xpetra_ENABLE_Experimental=ON
andMueLu_ENABLE_Experimental=ON
(note objection in https://github.com/trilinos/Trilinos/issues/2317#issuecomment-376575762).OMP_NUM_THREADS=2
).The most difficult change will likely be to enable OpenMP because of the problem of the threads all binding to the same cores as described in #2422. Therefore, the initial auto PR build may not have OpenMP enabled due to these challenges.
Tasks:
Xpetra_ENABLE_Experimental=ON
andMueLu_ENABLE_Experimental=ON
in CI build ... Merged in #2467 and was later removed in 7481c760699d8b0c30034782cb2ef0c742ce6657 [DONE]GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
in #2688) [DONE]Trilinos_ENABLE_OpenMP=ON
andOMP_NUM_THREADS=2
(see buildGCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
in #2688) [DONE]Related Issues: