trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Set up a CUDA build for an auto PR build #2464

Closed bartlettroscoe closed 5 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott, @nmhamster

Description

This Issue is to scope out and track efforts to set up a CUDA build of Trilinos to be used as an auto PR build as described in https://github.com/trilinos/Trilinos/issues/2317#issuecomment-376551457.

For this build it was agreed to use that ATDM build on white that is currently running and submitting to CDash. Questions about how to extend this build to be used as an auto PR build include:

Tasks:

  1. Clean up the existing CUDA build on white until it is 100% clean [Done]
  2. Set up an all-at-once nightly build that enables all PT package that submits to CDash "Specialized" [Done]
  3. Clean up the all-at-once nightly build for all PT packages (disable whatever should be disabled) ...
  4. ???

Related Issues:

bartlettroscoe commented 5 years ago

On this last Thursday, I had a long conversation with @jwillenbring about this CUDA PR build on 'ride' and 'white'. We decided that it was time to turn this over to the @trilinos/framework team. The candidate CUDA build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt on 'ride' shown running for the last 32 days is shown at:

As you can see, from that query, the build and the tests all completed and submitted results every day. And recently, the build have only been taking about 2h 30m and running the tests are taking 1h 11m. That is well within a reasonable runtime for a Trilinos PR build. And the robustness has been excellent. In fact, if you look at all of the 186 ATDM Trilinos builds run on 'ride' since 9/26/2018 here, there was not a single case where build and test results did not get submitted.

The configuration for this build is simply:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

NOTE: You can load the env, configure, and build completely on a compute node. To do this, just put your CTest -S driver in a bash script and run it with:

$ bsub -x -Is -q rhel7F -n 16 <ctest_s_driver_script>

You can see exactly how the ATDM Trilinos CTest -S driver system (that uses the TriBITS CTest -S driver) drives this build on 'ride' by looking at this file:

The jenkins-srn.sandia.gov project driver can be seen at:

(NOTE: We use the Jenkins job-inheritance plugin so you will have to trace subclass jobs.)

Hopefully that should be all of the information that the Framework team needs to be able to copy the settings for this build out into their own files and then stand this build up. If not, please ask me any question you might have about this.

@jwillenbring and I talked about the Framework team doing the following:

  1. Copy the settings out of the above-referenced files under Trilinos/cmake/std/atdm/ into new files under Trilinos/cmake/std/ that the Framework team will own. (This will avoid depending on files under Trilinos/cmake/std/atdm/ that don't get tested in PR builds.)

  2. Disable the current set of tests that don't build and the tests that don't run in the *.cmake file for this PR CUDA build (see these Issues).

  3. Set up a new CTest -S driver script for this CUDA build on 'ride' and set up a Jenkins job on 'ride' on jenkins-srn.sandia.gov to driver this build.

  4. Start running this CUDA build as part of the develop to master promotion PR builds and see how it goes for a while.

  5. Once the new CUDA build is shown to be robust running in develop to master promotion builds for a while, then add this CUDA build to the set of Trilinos PR builds.


NOTE: We have been seeing problems getting test results for this build on 'white' recently for the last three days since 10/23/2018 as shown here. The Jenkins jobs for these builds can be seen at:

(see build numbers #79, #80, and #81). From looking at the Jenkins output, it seems that a bunch of tests are timing out at 10 minutes and for some reason and the jobs were killed at the SLUM job time limit of 12 hours. But since this PR build would run on 'ride', not 'white', we can track down those problems with the Test Bed team. I will bring this up at the ATDM PI meeting on Monday.

bartlettroscoe commented 5 years ago

FYI: I removed the ATDM DevOps label and added the Framework label in order to officially hand this off to the @trilinos/framework team.

jhux2 commented 5 years ago

@trilinos/framework Please include the following option in the CUDA configure for this PR:

-D MueLu_ENABLE_Kokkos_Refactor_Use_By_Default:BOOL=YES

Thank you.

srajama1 commented 5 years ago

@jwillenbring : @bartlettroscoe brought this up in ATDM meeting. Can someone describe a plan for ths ? Also it would be nice to assign it to a person, so there is a responsible person we can talk to.

@jwillenbring @william76

bartlettroscoe commented 5 years ago

@trilinos/framework,

For the reasons described in #3998, this needs to be a release-debug-pt build, not just a debug build. The cuda-9.2-release-debug build runs more tests and catches more defects than either a cuda-9.2-opt or cuda-9.2-debug build (and should run faster than a cuda-9.2-debug build).

I will update the instructions above accordingly.

jwillenbring commented 5 years ago

@jwillenbring : @bartlettroscoe brought this up in ATDM meeting. Can someone describe a plan for ths ? Also it would be nice to assign it to a person, so there is a responsible person we can talk to.

@srajama1

I just assigned this to @ZUUL42 who was going to meet with @prwolfe yesterday to start work on this. We spoke a little about the general process and he recently set up the new 7.3 dev->master build.

srajama1 commented 5 years ago

@jwillenbring Thank you !

bartlettroscoe commented 5 years ago

FYI: A CUDA PR build would have avoided #4050 that took out almost all of the Panzer tests (i.e. we lost a day of CUDA tests for Panzer we were hopping to get to help clean up our CUDA builds).

bartlettroscoe commented 5 years ago

@trilinos/framework

FYI: The ascic-jenkins.sandia.gov slave for 'ride' looks like it has 16 executors from looking at:

?

That has a very strange curve for "available executors". It seems unlikely that your ascic-jenkins.sandia.gov jobs will experience the crashes that we have been experiencing on jenkins-srn.sandia.gov documented here:

Let me know if you have any questions about creating this CUDA PR build based on our cuda-9.2-release-debug build that we have been running for a long time.

bartlettroscoe commented 5 years ago

FYI: #4123 would have been avoided if this CUDA PR build would have been running. I lost at least 2+ hours last night and today on that one. Also, it looks like EMPIRE pulled that version of Trilinos (because I don't think they yet have a CUDA build).

bartlettroscoe commented 5 years ago

CC: @rppawlo, @fryeguy52

@trilinos/framework,

As per https://github.com/trilinos/Trilinos/pull/4146#issuecomment-452097215 and can we make sure that this CUDA PR build getting set up on 'ride' sets:

set(Kokkos_ENABLE_Profiling OFF CACHE BOOL)

?

That will catch issues like #4145 before they they can break ATDM Trilinos builds.

mhoemmen commented 5 years ago

@bartlettroscoe Did you mean "ON CACHE BOOL"?

mhoemmen commented 5 years ago

Wait, why is ATDM turning off Kokkos profiling?!?

bartlettroscoe commented 5 years ago

@mhoemmen said:

Wait, why is ATDM turning off Kokkos profiling?!?

Don't know. Need to find out from the EMPIRE developers. Note that the current native SPARC configuration for Trilinos does NOT set Kokkos_ENABLE_Profiling=OFF.

I will create a new ATDM Trilinos GitHub issue to see about removing Kokkos_ENABLE_Profiling=OFF from the ATDM Trilinos configuration.

bartlettroscoe commented 5 years ago

CC: @srajama1

@trilinos/framework, please make sure the new CUDA PR build on 'ride' enables the ShyLU_DD package. As of today, it should be 100% clean in that CUDA 9.2 build on 'ride' (see #3541). We need the CUDA PR build to protect the ShyLU_DD packages's CUDA build.

bartlettroscoe commented 5 years ago

FYI: While reviewing PR #4332, I just noticed that there is now a Trilinos_pullrequest_cuda_9.2-83 build as shown in https://github.com/trilinos/Trilinos/pull/4332#issuecomment-460972709. This is a PR that changes Teuchos so it should test everything downstsream in Trilinos. Looking at the configure output on CDash here, we can see the following explicit disables:

Explicitly disabled packages on input (by user or by default):  Claps SEACAS Trios Komplex TriKota Moertel PyTrilinos NewPackage 8

This shows that SEACAS is being disabled. Since SEACAS plays a critical role in ATDM, we need to get SEACAS enables ASAP to protect ATDM and other important customers. This build should 100% match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt which builds and tests SEACAS just fine as shown, for example, today here.

As I told @jwillenbring, I will look into what the problem with this CUDA PR build is and get it to match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt and then create a PR to enable SEACAS again.

bartlettroscoe commented 5 years ago

FYI: I am working on fixing the problems in the Trilinos CUDA PR build. A few things I am seeing right away:

I will run the configures and builds and get these to match up. Once I have everything matched up and the correct set of test disables added, I will post a PR to merge in this configuration. I will also provide detailed instructions on how I did this so others can copy this process in the future for future PR builds.

bartlettroscoe commented 5 years ago

I just posed PR #4592 which fixes the CUDA PR build. It enables all of SEACAS and STK (and all their tests) and it enables all 160 Panzer BASIC tests. And they all pass (except for one recent known STK test failure with CUDA described in #4551). The process I used to fix the build was pretty simple and is described in detail below (so that others can follow a similar process in the future). With all the configure iterations, it took about 4 hours to complete this matching (mostly because the configure is slow in the NFS mounted drive on 'ride'). Once I got the configure diffs to match up, the build and tests ran right out of the box (with the one expected failing STK test).


Details on the process to fix the CUDA PR build to match the ATDM Trilinos build

First, I set up the build dir:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

with the files load-env.sh:

source /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \
  cuda-9.2-gnu-7.2.0-release-debug-pt

and do-configure:

cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DKokkos_ENABLE_Profiling=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_TEST_CATEGORIES=BASIC \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
-DTrilinos_ENABLE_CONFIGURE_TIMING=ON \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

(NOTE: I set some of these options to better match the Trilinos CUDA PR build settings.)

I ran the base configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

$ . load-env.sh
Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'cuda-9.2-gnu-7.2.0-release-debug-pt'
Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON \
  &> configure.out

real    6m34.854s
user    2m56.180s
sys     1m19.912s

(NOTE: The Trilinos CUDA PR build always set Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON which is actually not desirable if you just want to reproduce you one package build.)

I then set up a configure and build directory for the Trilinos CUDA PR configuration:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

with files load-env.sh:

module purge
export WORKSPACE=/home/rabartl/Trilinos.base
source /home/rabartl/Trilinos.base/Trilinos/cmake/std/sems/PullRequestCuda9.2TestingEnv.sh

and do-configure:

cmake \
-GNinja \
-C /home/rabartl/Trilinos.base/Trilinos/cmake/std/PullRequestLinuxCuda9.2TestingSettings.cmake \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

I ran the configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

real    4m58.255s
user    2m35.155s
sys     0m26.741s

I created the script create_normalized_cmake_output_files.sh:

#!/bin/bash

# Get build-dir name from argument
BUILD_DIR_NAME=$1
echo "BUILD_DIR_NAME='${BUILD_DIR_NAME}'"

set -x

cat CMakeCache.txt | grep -v "^$" | grep -v "^//" | grep -v "^#" | sort \
  > CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   CMakeCache.normalized.txt CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   configure.out configure.normalized.out

I then ran it as:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ cd cuda-9.2-gnu-7.2.0-release-debug-pt/

$ ../../../create_normalized_cmake_output_files.sh cuda-9.2-gnu-7.2.0-release-debug-pt

$ cd ..

$ cd pull-request-cuda-9.2/

$ ../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

$ cd ...

I then compare the two sets of files with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
    pull-request-cuda-9.2/configure.normalized.out \
  | less

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
    pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

One difference I noted was:

74c50
< Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota NewPackage 4
---
> Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota PyTrilinos NewPackage 5

It is not necessary to explicitly disable PyTrilinos in this context because it is not a Primary Tested package (so it would not get enabled). But that should be harmless as it will not impact the final set of enabled and non-enabled SE Packages and TPLs.

Using the script configure-pr-build-and-diff.sh:

#!/bin/bash

cd pull-request-cuda-9.2/

. load-env.sh

rm -r CMake*
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

cd ..

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
  pull-request-cuda-9.2/configure.normalized.out \
  | less

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
  pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

I did several of iterations of modifying the file PullRequestLinuxCuda9.2TestingSettings.cmake and running:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ configure-pr-build-and-diff.sh

and carefully inspecting the diffs until I got them pretty close. The final diffs are shown in :

The diffs that remained should not impact what builds and what passes and fails.

I then did a full build and ran the test suite on 'ride' using the script run_all.sh:

#!/bin/bash -e
. load-env.sh
rm -r CMake* || echo "no CMake files to remove!"
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out
time ninja -j64 &> make.out
time ctest -j8 &> ctest.out

and I ran this with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ bsub -x -Is -q rhel7F -n 16 ./run_all.sh

***Forced exclusive execution
Job <854522> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride12>>
rm: cannot remove ‘CMake*’: No such file or directory
no CMake files to remove!

real    4m19.775s
user    2m36.281s
sys     0m36.791s

real    172m17.763s
user    8615m25.206s
sys     865m48.777s

That gave the test results:

99% tests passed, 1 tests failed out of 2936

Subproject Time Summary:
Amesos                    =  27.54 sec*proc (13 tests)
Amesos2                   =  35.31 sec*proc (8 tests)
Anasazi                   = 332.37 sec*proc (74 tests)
AztecOO                   =  26.23 sec*proc (17 tests)
Belos                     = 415.65 sec*proc (100 tests)
Domi                      = 232.67 sec*proc (125 tests)
Epetra                    =  85.45 sec*proc (63 tests)
EpetraExt                 =  26.28 sec*proc (10 tests)
FEI                       =  43.59 sec*proc (43 tests)
Galeri                    =  12.38 sec*proc (9 tests)
GlobiPack                 =   2.83 sec*proc (6 tests)
Ifpack                    =  99.66 sec*proc (48 tests)
Ifpack2                   = 363.30 sec*proc (45 tests)
Intrepid                  = 383.97 sec*proc (143 tests)
Intrepid2                 = 544.01 sec*proc (267 tests)
Isorropia                 =  13.20 sec*proc (6 tests)
Kokkos                    = 170.39 sec*proc (27 tests)
KokkosKernels             = 167.29 sec*proc (8 tests)
ML                        =  75.86 sec*proc (34 tests)
MiniTensor                =   3.52 sec*proc (2 tests)
MueLu                     = 2782.78 sec*proc (105 tests)
NOX                       = 290.93 sec*proc (106 tests)
OptiPack                  =   7.93 sec*proc (5 tests)
Panzer                    = 8737.40 sec*proc (163 tests)
Phalanx                   =  19.35 sec*proc (27 tests)
Pike                      =   3.77 sec*proc (7 tests)
Piro                      =  47.12 sec*proc (13 tests)
ROL                       = 1306.07 sec*proc (164 tests)
RTOp                      =  19.57 sec*proc (24 tests)
Rythmos                   =  68.87 sec*proc (83 tests)
SEACAS                    =  22.87 sec*proc (23 tests)
STK                       =  95.71 sec*proc (15 tests)
Sacado                    = 169.57 sec*proc (300 tests)
Shards                    =   1.42 sec*proc (4 tests)
ShyLU_DD                  = 300.03 sec*proc (37 tests)
Stokhos                   = 156.84 sec*proc (84 tests)
Stratimikos               =  37.45 sec*proc (39 tests)
Teko                      = 592.89 sec*proc (18 tests)
Tempus                    = 402.99 sec*proc (80 tests)
Teuchos                   = 161.02 sec*proc (137 tests)
Thyra                     = 102.25 sec*proc (82 tests)
Tpetra                    = 1158.23 sec*proc (201 tests)
TrilinosCouplings         =  32.43 sec*proc (22 tests)
TrilinosFrameworkTests    =   5.50 sec*proc (4 tests)
Triutils                  =   3.83 sec*proc (2 tests)
Xpetra                    = 262.52 sec*proc (18 tests)
Zoltan                    = 345.53 sec*proc (14 tests)
Zoltan2                   = 553.05 sec*proc (111 tests)

Total Test time (real) = 2654.55 sec

The following tests FAILED:
    2036 - STKUnit_tests_stk_ngp_test_utest_MPI_4 (Failed
Errors while running CTest

See, we now have 23 SEACAS tests, 15 STK tests and 163 Panzer tests! Before there were only 60 Panzer tests as shown, for example, in this recent CUDA PR build and no SEACAS or STK tests.

The only failing test was STKUnit_tests_stk_ngp_test_utest_MPI_4 which is already known to be failing as described in #4551. Therefore, I added a disable for that test as well. To check that disable I did a new configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/
$ . load-env.sh
$ cmake . &> configure.reconfig.out

which showed:

$ grep STKUnit_tests_stk_ngp_test_utest_MPI_4 configure.reconfig.out 
-- STKUnit_tests_stk_ngp_test_utest_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!

I then cleaned up the commits and created the PR #4592.

alanw0 commented 5 years ago

As I said on the other issue, we will try to get a stk update in, to fix the failing test, asap.

bartlettroscoe commented 5 years ago

@trilinos/framework,

With more code being enabled in the CUDA PR build due to the changes in PR #4592 the build times have gone up a lot as shown in the PR build #4592 on CDash. That shows a build time of 4h 49s! But if you look at the history of the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt this is supposed to be duplicating over the last 3 weeks you can see the build times are just under 3 hours.

This must mean that the number of build processes is not correct. If you look at the Jenkins build for these at:

and look at the output for example at:

you can see:

03:03:07 -- CTEST_BUILD_FLAGS='-j64 -k 999999'
...
03:03:07 -- CTEST_PARALLEL_LEVEL='8'

So you want to set 64 build processes and 8 parallel ctest MPI processes (i.e. ctest -j8).

bartlettroscoe commented 5 years ago

CC: @dridzal, @rppawlo

@trilinos/framework,

The updated CUDA PR build in PR #4592 failed with 6 failing tests, 1 timing-out ROL test, 3 timing out Panzer tests and 2 failing Panzer tests that show CUDA allocation "out of memory" failures. This is likely due to using too high of a parallel level with ctest -j<N>. From looking at the Jenkins output for this build here it showed:

Parallel level           = 29

If that means that is is using ctest -j29 that is way too high. This needs to be lowered to ctest -j8 as described above. That will increase the test wallclock time a little but it will result in all passing tests.

Can someone on the @trilinos/framework team please update this CUDA PR build to use 64 parallel processes and only 8 parallel ctest processes on 'ride'? That will result in a total wall-clock time of a little over 4 hours in the worst case where now the CUDA PR build looks like it takes almost 5 hours and results in failing and timing out tests.

bartlettroscoe commented 5 years ago

@trilinos/framework, is this done done? Has the parallel test level been reduced to 8 (or so) and have all the temp disables been removed?

bartlettroscoe commented 5 years ago

This has been done for a while. The CUDA PR build looks to be one of the most robust PR builds being used. Closing as complete.

nmhamster commented 5 years ago

Horray!