bartlettroscoe commented 6 years ago

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott, @nmhamster

Description

This Issue is to scope out and track efforts to set up a CUDA build of Trilinos to be used as an auto PR build as described in https://github.com/trilinos/Trilinos/issues/2317#issuecomment-376551457.

For this build it was agreed to use that ATDM build on white that is currently running and submitting to CDash. Questions about how to extend this build to be used as an auto PR build include:

Should all of the PT package tests be built and run or just the ones that the current ATDM build of Trilinos builds and runs? (Things can be set up to either way.)
Does the machine 'white' have enough computing capacity in order to handle the load of builds needed for Trilinos PR testing?
Are the Jenkins jobs running the builds using the bsub command robust enough to be a reliable PR build?

Tasks:

Clean up the existing CUDA build on white until it is 100% clean [Done]
Set up an all-at-once nightly build that enables all PT package that submits to CDash "Specialized" [Done]
Clean up the all-at-once nightly build for all PT packages (disable whatever should be disabled) ...
???

Related Issues:

Part of: #2317

bartlettroscoe commented 6 years ago

FYI: We asked @nmhamster about using the rhel7F nodes on white for auto PR builds and he said we could try this.

Note that the target for this build should be the Trilinos-atdm-white-ride-cuda-debug build and not the Trilinos-atdm-white-ride-cuda-opt build due to the large number of segfaulting tests on the latter build described in #2454. I will focus on cleaning up the cuda-debug build as there are just a few failing tests at this point.

Other than setting up the Jenkins job and cleaning up any Trilinos failures with this setup, my biggest concern is the stability of the Jenkins jobs on 'white'. For example, if you look at the history of the nightly build Trilinos-atdm-white-ride-cuda-debug on white shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-debug&field2=site&compare2=63&value2=white&field3=buildstarttime&compare3=84&value3=now

only gets through all 25 of the packages (using the package-by-package method) about half the time. That is not great reliability for an auto PR build.

Perhaps the all-at-once configure, build, test and submit will be more robust? Out nightly build will tell that story.

mhoemmen commented 6 years ago

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

bartlettroscoe commented 6 years ago

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

As we discussed at the meeting last Thursday, getting the CUDA build set up will not block moving to the new auto PR system as the way to push to Trilinos. The only build that has to be right is the GCC 4.8.4 to replace and protect the current CI build (see #2462). I may work to set up this CUDA build as a post-push CI build until running jobs on 'white' is stable enough for an auto PR build. But the problem is, when the post-push CUDA CI build breaks, who is going to make sure it gets cleaned up ASAP? Right now, that looks to be me so I am super motivated to get a CUDA build running as part of auto PR testing.

We need to write up the transition plan for moving to the auto PR system so there is no confusion about things like this.

mhoemmen commented 6 years ago

I will also invest time in Tpetra-related CUDA issues and other issues that my ATDM and Sierra customers care about.

bartlettroscoe commented 6 years ago

NOTE: The cuda-debug failures that occurred in the ATDM builds of Trilinos described in #2471 when KOKKOS_ENABLE_DEBUG=ON was set is more motivation that this auto PR CUDA build should be a cuda-debug build and not a cuda-opt build. That is, we should have debug-mode checking enabled. This is not a performance build of Trilinos but a correctness build.

bartlettroscoe commented 6 years ago

Note that with #2471 now resolved, the only impediment to using the ATDM Trilinos CUDA-debug build on 'white' as an auto PR build is to get the bsub command to stop terminating early on 'white'. I am meeting with Nathan G. on the Test Bed team today to discuss this problem.

bartlettroscoe commented 6 years ago

Status update ...

The Trilinos-atdm-white-ride-cuda-debug-all-at-once build is 100% clean and was promoted to the "ATDM" CDash Group/Track on 4/3/2018 and completed all 25 packages today.

I set up an all-at-once version of this build in:

https://jenkins-son.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-debug-all-at-once/

and I fired it off to submit to CDash.

We will see how long an all-at-once build for this cuda-debug build takes.

bartlettroscoe commented 6 years ago

FYI: I set up a all-at-once cuda-debug build for Trilinos for all 53 Primary Tested packages on 'white' and 'ride' submitting to CDash as the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once as shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-04-11&filtercount=1&showfilters=1&field1=buildname&compare1=63&value1=-atdm-

This currently fails the configure of ShyLU_Node as shown at:

https://testing-vm.sandia.gov/cdash/viewConfigure.php?buildid=3415664

showing:

Processing enabled package: ShyLU_Node (Tacho, Tests, Examples)
CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE):
  ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is
  enabled.  Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON

The current ATDM build of Trilinos does not enable ShyLU so those builds are not showing that configure failure.

But note that SPARC does use ShyLU_Node (at at least some of its subpackages get enabled).

Therefore, for now, I would recommend that we disable ShyLU_Node in this initial PT CUDA build of Trilinos targeted for PR testing (but not disable ShyLU_Node or anything else in the other auto PR builds). Getting something up is better than nothing for auto PR testing.

@william76 and @jwillenbring, do you agree?

srajama1 commented 6 years ago

@bartlettroscoe : Nope. Tacho is the primary test case for Kokkos tasking. It has exposed lot of subtle issues before. I wouldn't disable it, but I would set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON which is a requirement set by Kokkos. I believe we need tasking for 1 out of the 2 ATDM applications.

ibaned commented 6 years ago

@srajama1 there is a bug in the NVIDIA linker which prevents many Trilinos packages from compiling with relocatable device code on (with tests enabled). This will probably limit how much of Trilinos we are able to test in that configuration.

srajama1 commented 6 years ago

@ibaned : Didn't know this linker bug. Do you know which feature of Kokkos tasking requires relocatable device code ?

ibaned commented 6 years ago

@srajama1 I have been told the entire system known as "Kokkos tasking" (e.g. the spawn-based system) requires relocatable device code. I don't know in detail what parts would break if we don't have it enabled. This is a difficult trade-off, but at the moment that is the situation. I think @rppawlo has built reasonable subsets of Trilinos with relocatable device code, but maybe not with tests enabled.

srajama1 commented 6 years ago

Ah, I believe most of the use cases that are needed by this PR (PR testing on GPUs for ATDM apps) would have been covered by @rppawlo . I believe we should be able to use relocatable device code in that case, assuming tests work.

bartlettroscoe commented 6 years ago

As of now, the EMPIRE configuration of Trilinos (which this current ATDM Trilinos configuration is matching) does NOT setting Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON (hence the error shown above).

But it looks like some of the SPARC configurations of Trilinos do set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON as shown by:

$ cd <sparc-tpl-base-dir>/
$ grep -nH Kokkos_ENABLE_Cuda_Relocatable_Device_Code *.sh
do-cmake_trilinos_cee-gpu_cuda_gcc_openmpi.sh:158:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=OFF \
do-cmake_trilinos_ride-gpu_gcc_cuda_openmpi.sh:147:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \
do-cmake_trilinos_shiller-gpu_gcc_cuda_openmpi.sh:157:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \

@micahahoward,

Is SPARC really using relocatable device code with Kokkos on 'ride'? Does this work with all of the Trilinos packages currently used by SPARC?

@rppawlo and @nmhamster,

It it worth trying to set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON in this full CUDA configuration of Trilinos which includes Phalanx, Panzer and other packages (that are not used by SPARC) on 'white'/'ride'?

rppawlo commented 6 years ago

I have only built up to phalanx testing with relocatable device code enabled. It was for experimenting with the device DAG support for assembly in phalanx. I did not test panzer, Tpetra or the linear solver stack as this only involved assembly. We only came across one real build issue in sacado due to a static variable for the kokkos memory pool. We needed an ifdef to change the static declaration depending on whether RDC was enabled. Everything else seemed to work fine. I would give it a shot.

micahahoward commented 6 years ago

Short answer: no on using RDC with SPARC.

We have issues with RDC in SPARC. I've backed this off in our Trilinos config but haven't pushed those changes to our sparc/Trilinos repo.

bartlettroscoe commented 6 years ago

@micahahoward said:

Short answer: no on using RDC with SPARC.

We have issues with RDC in SPARC. I've backed this off in our Trilinos config but haven't pushed those changes to our sparc/Trilinos repo.

Okay, so SPARC and EMPIRE don't enable RDC for Trilinos. That means that this auto-PR CUDA build should not enable it either. (But someone is free to set up another CUDA build that does enable RDC if they wish.)

@micahahoward,

Can you point me to the current SPARC Trilinos configuration scripts offline so that I can see how the SPARC Trilinos configuration is enabling some ShyLU_Node code but is not enabling Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON?

mhoemmen commented 6 years ago

@bartlettroscoe SPARC doesn't actually set the Kokkos flag in question. I think it defaults to OFF. I will e-mail you an example script.

srajama1 commented 6 years ago

@bartlettroscoe : How do you work out the difference between the two app configurations currently ?

bartlettroscoe commented 6 years ago

@srajama1 asked:

How do you work out the difference between the two app configurations currently ?

The plan is to merge the configurations but allow different sets of packages to be enabled. For example, SPARC is currently using ROL and ShyLU but EMPIRE is not. And EMPIRE is currently using Phalanx and Panzer but SPARC is not. But if SPARC or EMPIRE enables Tpetra or MueLu, for example, then these package configurations will be identical. That is the goal. And that will allow us to have one set of nightly builds of Trilinos 'develop' instead of two. So far, I have not see anything that should stop us from having a single set of shared configurations of Trilinos for SPARC and EMPIRE (and therefore a single set of configurations used by @nmhamster's performance team, etc.).

bartlettroscoe commented 6 years ago

After PR #2601 was merged, the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once which includes all of the 53 Primary Tested packages of Trilinos passed the configure but shows build and runtime failures for several packages as you can see at, for example:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3451394

This shows build failures for the packages:

ROL
MiniTensor
Zoltan
TrilinosCouplings
ShyLU_DD
Domi

And it shows test failures for the packages:

Stokhos
ROL
Zoltan
TrilinosCouplings
Piro
FEI

Since ATDM APPs don't currently use these packages, I don't know if it is worth the time to get these cleaned up before getting a CUDA auto-PR build running. I am happy to keep this build running so that someone else can clean up these failures (in another GitHub Issue) but I don't think this should be blocker for getting an initial auto PR CUDA build running.

Therefore, I would like to propose that we first set up a auto PR build that only allows the building of ATDM packages of Trilinos plase the other PT packages that are currently clean (because why not).

Also, we should switch this build to run only faster BASIC tests and not NIGHTLY tests.

Any objections to that plan?

ibaned commented 6 years ago

@bartlettroscoe I don't mind disabling these packages for a CUDA build, although developers should maybe be made aware that their packages are broken on CUDA. Also, we should mirror this in the testing of Trilinos that we do for Kokkos releases, in particular we should only test the packages that you enable in the PR CUDA build.

bartlettroscoe commented 6 years ago

@ibaned said:

@bartlettroscoe I don't mind disabling these packages for a CUDA build, although developers should maybe be made aware that their packages are broken on CUDA.

I created the issue #2620 and @mentioned all of the package teams or people associated with these packages. If someone wants to pursue cleaning up the failures for those packages, then I think that is in everyone's best interest. It is just that we have limited ATDM funds to set up these builds and getting these packages clean up on CUDA builds is not an ATDM priority.

Also, we should mirror this in the testing of Trilinos that we do for Kokkos releases, in particular we should only test the packages that you enable in the PR CUDA build.

It would be great of the Kokkos updates could est with the ATDM builds of Trilinos which are documented at:

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos

In fact, the checkin-test-atdm.sh script can be run on each machine to test every package downstream from Kokkos.

I can provide more details to help with this. Not all of these builds are 100% clean yet (and the 'opt' builds on 'white' and 'ride' have a bunch of segfaulting tests as noted in #2454). But selecting the subset of builds that are being kept clean is as easy as looking on CDash.

bartlettroscoe commented 6 years ago

FYI: Having an auto PR build that just builds the code and does not run tests would not have caught #2650 that just occurred yesterday.

bartlettroscoe commented 6 years ago

CC: @trilinos/framework

If we focus on just the all-at-once build with the 25 packages that ATDM is currently testing on 'white', you can see in the query:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-05-04&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-debug-all-at-once&field2=site&compare2=61&value2=white&field3=buildstarttime&compare3=84&value3=now

that it completes the tests more times than not. In fact, that query today shows history for 23 builds and out of those 18 of them completed the tests. That is a completion ratio of 0.78.

It just occurred to me yesterday, that with the new split ctest -S driver support in the TRIBITS_CTEST_DRIVER() function documented here, we could put a loop around a second ctest -S invocation that just runs the tests. If the bsub command aborted before it finished running the tests, we would run it again. With a 0.78 probability of the bsub command completing and getting back test results, that is 1-(1-0.78)^2 = 0.95 probability of two attempts giving back test results. That sounds like good odds to me. If it did not complete the second time, then you give up and just call the CUDA passed if passed first ctest -S invocation that does the build passed. Or just done one attempt to run the tests and if that failed, then just check the build result. Given the actual build completes in under 2:20, the extra 50m run of the tests puts the average total time at just around 3:10.

The current PR build that tests all of the package takes around 3:00 so this ATDM CUDA build is right in the ball park.

Anyway, that is an idea of a way to allow us to use 'white' for a fairly PR build given what we have right now. Should we just go ahead and do this?

bartlettroscoe commented 6 years ago

Given that we have not been able to get tests to run and submit to CDash reliably on 'white' due to the LSF bsub command randomly crashing (see TRIL-198), we need to look into other options for a way to get an auto PR CUDA build set up.

There is a new Intel Broadwell + GPU (P100s) cluster getting set up on the SRN called Doom. This machine has 30 nodes with two sockets per node and 18 cores per socket (i.e. 36 total cores per node). It has 4 NVIDA P100s (14K cores per node) and 512 GB of RAM per node. It is using SLURM for job scheduling. It is currently in early testing mode. The plan is for it to be available to run jobs using the ascic-jenkins site (not sure of the status of that yet). I don't think anyone has yet done a CUDA build of Trilinos there yet but this is a basic Linux + GPU machine so it should not be very hard to set up a working Trilinos CDA build there.

We have been told that we can try using Doom to set up and run an auto PR CUDA build of Trilinos. That machine currently has plenty of capacity to run auto PR CUDA builds for Trilinos.

The only issue with using Doom that it is an SRN system and therefore Sandia Trilinos developers who can't access the SRN (e.g. non-US postdocs) would not be able to log onto that system and reproduce builds and tests and fix failures. I talked with @crtrott on 4/2/2018 and he said that he could give accounts to these few non-US developers on his x86 + GPU SON machine 'apollo' and the SON machine 'kokkos-dev' in order to reproduce CUDA failures (if they can't on their own macines, or other x86 + GPU SON machines). @crtrott said that about 95% of failures produced on this Doom machine could likely be reproduced on these other x86 + GPU SON machines.

The longer-term plan that @crtrott mentioned is that Kokkos could purchase and set up a new 'kokkos-dev2' machine and then the old 'kokkos-dev' machine could be used an an official resource on the SON in order to help developers reproduce CUDA builds and even do some pre-push development and testing for these non-US Trilinos developers.

@crtrott, is the above consistent with your memory of our conversation?

bartlettroscoe commented 6 years ago

Some follow up ...

It looks like the SRN Doom cluster is not a viable option right now. This is a very new system and there does not seem to be a plan to put a Jenkins client on it soon. Therefore, we could not use it for auto PR testing.

Another option that has been presented to us is some x86 + GPU nodes on the ascic-jenkins build farm which are used for Sierra testing. We have been allocated 2 nodes to use for Trilinos testing. @jwillenbring, if you can give me access to create Jenkins jobs on ascic-jenkins, I can give a CUDA build based on the ATDM Trilinos configurtion a try.

bartlettroscoe commented 6 years ago

FYI: The LSF system on 'white' stopped processing jobs on 5/12/2018. I pinged 'white-help' and they restarted them. But there were three days worth of jobs in the queue. I killed the later jobs for 5/13 and 5/14 and now they are submitting to CDash.

One more negative to trying to run and auto PR tester on 'white'.

bartlettroscoe commented 6 years ago

FYI: Some data on testing

Some of our struggles setting the ATDM Builds of Trilinos is getting several tests to run at the same time and use up the available cores on the node (this is done to reduce the wall clock time to run the tests). But this has caused challenges that some of the application codes (like SPARC) don't have since they just run their tests one at a time with ctest (which is equivalent to ctest -j1). But for the Trilinos test site, it would be very expensive to run the tests one at a time. As an example, consider the CUDA tests for the build Trilinos-atdm-white-ride-cuda-debug-all-at-once shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-05-15&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-debug-all-at-once&field2=site&compare2=61&value2=white&field3=buildstamp&compare3=61&value3=20180515-0400-Specialized

That shows the wall-clock time for running the tests of 50m 17s with a modest parallel level of ctest -j8.

The 1791 tests run are shown at:

https://testing-vm.sandia.gov/cdash/viewTest.php?onlypassed&buildid=3513466

If you add up all of the raw test times, it gives 10,851.92 seconds which is is just over 3 hours.

Therefore, if you ran this test suite with ctest -j1, then the test time could go up from 50 min to 3 hours!

We just can't afford to run Trilinos tests sequentially.

bartlettroscoe commented 6 years ago

FYI:

There was a discussion in issue #2827 about suggesting that we needed to run tests with CUDA with ctest -j1 since that would avoid random failures and timeouts. I ran some timing experiments comparing ctest -j1 and ctest -j8 for the CUDA 8.0 builds on 'hansen'/'shiiler' and 'white'/'ride' detailed in https://github.com/trilinos/Trilinos/issues/2827#issuecomment-393541395 and https://github.com/trilinos/Trilinos/issues/2827#issuecomment-394032043. The conclusion of the analysis of those experiments is that while running the tests in parallel with ctest -j8 does increase the wall-clock time for some CUDA-bound tests, the increase in runtimes was not that significant (between 18 and 50% on average) but the decrease in wall-clock time was significant. Below is the table for the wall-clock time for running the ATDM Trilinos test suites for ctest -j1 vs ctest -j8 for two different cuda-debug build on white and hansen (two very different machines).

Build	`ctest -j1`	`ctest -j8`
`Trilinos-atdm-white-ride-cuda-debug`	3h 25m	1h 14m
`Trilinos-atdm-hansen-shiller-cuda-8.0-debug`	6h 6m 24s	1h 44m 32s

We might be able to tolerate an auto PR that takes 1h 44m to run tests but we likely can't tolerate one that that 3h 25m and certainly can't tolerate one that takes 6h to run. The CUDA builds alone vary from 2h 15m on 'white' to 5h 58m on 'hansen'.

This also shows that this cuda-debug build on 'white' with a build time of 2h 15m and a test run time of 1h 14m is well within the times for the other auto PR builds. We can tolerate a < 4 h build and test just fine because the new Intel PR builds take upwards of 4.5 hours to build and run tests as shown, for example, at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3564826

Darn it, if we could just get the 'bsub' command to stop randomly crashing on 'white', we would have a great CUDA PR build right now! And we could use it too because if this white cuda-debug would have been in the set of auto PR builds, we could have avoided the problems with the last Kokkos update detailed in #2827.

bartlettroscoe commented 6 years ago

FYI: The recent defect issue #2921 caused runtime failures in all of that ATDM CUDA builds. Therefore, any CUDA auto PR build that runs the tests would have caught this and stopped it from being merged to 'develop' until it got fixed first.

I am going to go ahead and set up a post-push CI build on 'white' that posts to the "Continuous" CDash group and sends out emails. At least that way we will catch these CUDA breakages before we go into nightly testing.

bartlettroscoe commented 6 years ago

I am going to go ahead and set up a post-push CI build on 'white' that posts to the "Continuous" CDash group and sends out emails. At least that way we will catch these CUDA breakages before we go into nightly testing.

But before I can do that, we need to address all of the current failures impacting the cuda-debug build on 'white' in the open issues #2921, #2920 and #2827.

bartlettroscoe commented 6 years ago

@prwolfe,

@jwillenbring said you are actively working on a CUDA build for auto PR testing. If this includes all of the Primary Tested package in Trilinos, you are likely going to see the build failures shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3625386

It would be good to create Trilinos GitHub issues for these build failures so that developers have time to clean up the issues before the auto PR build goes live (and to do so you will have to disable several packages and tests).

bartlettroscoe commented 6 years ago

FYI: Good news, by some unknown process, it looks like the full CUDA 8.0 build of all of the primary-tested packages in Trilinos is getting slightly cleaner. As shown in this query, the number of build errors has been reduced from 11 to 3. The most recent build today shown here shows just one build error each for the packages Zoltan, Stokhos, and ROL. We should create Trilinos GitHub issues for these failures so we can get these cleaned up so we can promote this build.

kddevin commented 6 years ago

I am having difficulty reproducing this error on white. Can you point me to the instructions for reproducing this build (which modules loaded, etc.)? Thanks.

bartlettroscoe commented 6 years ago

@kddevin said:

I am having difficulty reproducing this error on white. Can you point me to the instructions for reproducing this build (which modules loaded, etc.)? Thanks.

Information about the ATDM Trilinos builds, including reproducability info, can be found starting at:

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos

I will create a proper Trilinos GitHub Issue with exact commands to reproduce according the process described at:

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/Triaging+and+addressing+ATDM+Trilinos+Failures

bartlettroscoe commented 6 years ago

@kddevin said:

I am having difficulty reproducing this error on white. Can you point me to the instructions for reproducing this build (which modules loaded, etc.)? Thanks.

I created issue #3065 that should provide exact reproducability instructions on 'white' and 'ride' (I reproduced the build error myself on 'white' using those commands).

kddevin commented 6 years ago

Thanks, @bartlettroscoe . I'll give it a try and update these issues when I have some info.

kddevin commented 6 years ago

3065 is fixed and Zoltan is working.

bartlettroscoe commented 5 years ago

FYI: After the upgrade to CUDA 9.2 and a new LSF system on 'white' we now see no 'bsub' crashing while running tests but we do see 'bsub' crashing while running the build about 10% of the time. You can see this in this query for the build Trilinos-atdm-white-ride-cuda-9.2-debug-pt on 'white' running since 8/10/2018. Out of 43 days, it failed to return build and test results 4 times. I would like to argue that that is likely good enough for a Trilinos PR build if we make make the Trilinos PR system ignore the CUDA build if it reports configure results but test results. Having the CUDA build run and check branches 90% of the time a 10x improvement in our ability to catch PR branches that break the CUDA build before before merging to 'develop'. Yes that build takes upwards 4.25 hours of wallclock time but having a CUDA build on all of the Primary Tested packages in Trilinos would be huge.

bartlettroscoe commented 5 years ago

FYI: I talked with @nmhamster in detail about the prospect of using 'white' as a Trilinos PR build for CUDA. He said that it is actually a very good idea to use a Power system to test CUDA since that is more useful for ATDM than testing on a generic x86 system. Also, the mentioned that this sister machine 'ride' on the SRN is mostly unloaded now since SPARC moved off of 'ride'. Therefore, we could actually run Trilinos CUDA PR builds on 'ride' and then developers could reproduce builds and tests on 'white' (since it is identical).

Now we just need to solve the problem of 'bsub' crashing 10% of the time during the builds. @nmhamster suggested backing off of the build level from the current ninja -j128 to ninja -j64. He says that you are likely not going to see any speedup in building going above 64 build processes. I will make that change and see what impact this has on build times and robustness of the build (i..e. eliminate 'bsub' crashes during builds?).

jwillenbring commented 5 years ago

@bartlettroscoe

This sounds like potentially a good path-forward. I will mention this at the stand-up this morning. Even if we had to back off to -j32, I still think this is a good solution.

jwillenbring commented 5 years ago

I would like to argue that that is likely good enough for a Trilinos PR build if we make make the Trilinos PR system ignore the CUDA build if it reports configure results but test results

@bartlettroscoe

I would be concerned about that because then if a CUDA issue snuck through, it would block people from pushing 90% of the time. I think failing and retrying would be better.

bartlettroscoe commented 5 years ago

I would be concerned about that because then if a CUDA issue snuck through, it would block people from pushing 90% of the time. I think failing and retrying would be better.

@jwillenbring, that is a concern, but if that happens we just need to be diligent and revert the merge commit that caused this quickly. That is easy to detect by running a post-push CI build like we have been doing for the GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build for many years. We know within about 30 minutes when someone has broken the build on the 'develop' branch (because the post-push CI build does an incremental rebuild which is generally very fast compared to starting from scratch).

But note that it will be very rare that we have a PR that both breaks the CUDA build and where the CUDA PR build does not return test results. If we assume that a PR breaks the CUDA build once a week, then with a 90% success rate, on average we would only allow a PR build that breaks the CUDA PR build to skip through once every 2 months. Currently, every PR that breaks the CUDA build gets through! Let's not let perfect be the enemy of 10x better.

So now the next task is to clean up the current Trilinos-atdm-white-ride-cuda-9.2-debug-pt build until we get 100% passing tests. Once we do that, we can set up a post-push CI build that runs on 'ride' as the next step. That way we can catch merged PRs that break the CUDA build ASAP and back them out immediately before they impact the nightly builds and ATDM customers.

bartlettroscoe commented 5 years ago

FYI: The PR #3506 that changes from ninja -j128 to ninja -j64 on 'white' and 'ride' was merged last night as was used in the ATDM Trilinos builds on 'white' and 'ride' today.

You can see that in the build on 'white':

https://jenkins-son.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-9.2-debug-pt/50/consoleFull

which showed:

00:22:18 -- CTEST_BUILD_FLAGS='-j64 -k 999999'

and you can see this on the build on 'ride':

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-9.2-debug-pt/46/consoleFull

which showed:

22:17:13 -- CTEST_BUILD_FLAGS='-j64 -k 999999'

And looking at this query, as @nmhamster predicted, we see almost no meaningful slowdown in the build times for the build Trilinos-atdm-white-ride-cuda-9.2-debug-pt on 'white' and 'ride' (the build time is right around 3h 5m).

Now we need to get this build cleaned up ASAP so that we can set up a post-push CI build posting to CDash. Actually, we could set up a post-push CI build that posts to "Experimental" for now and then use that to get fast feedback on fixing the failures throughout the day.

bartlettroscoe commented 5 years ago

CC: @trilinos/framework, @trilinos/shylu, @trilinos/stokhos, @trilinos/rol, @trilinos/stk, @trilinos/trilinoscouplings, @trilinos/piro

FYI: I have created new Trilinos GitHub issues for all of the failures in the current Trilinos-atdm-white-ride-cuda-9.2-debug-pt build on 'white' and 'ride' today shown here which are:

3541: ShyLU_DD test build failures (just started today!)
3542: Stokhos runtime test failures
3543 ROL runtime test failures
3544: STK runtime test failures
3551: TrilinosCouplings runtime test failure
3552: Piro runtime test failure

Once this build is 100% cleaned up (by fixing or disabling things), then I will set up a post-push CI build ASAP so that we can start to get this clean. Then the @trilinos/framework team can add this as a Trilinos PR build and we can finally protect CUDA builds for all Primary Tested packages before the merge to 'develop'!

However, one major concern that I have with making this exact build a Trilinos PR build is that we would likely need to change the PR testing logic to trigger the testing of all Trilinos packages when the ATDM Trilinos configuration changes. We worked very hard in #3133 to update the system so that changes under Trilinos/cmake/std/atdm/ would not trigger a global enable and testing of all Trilinos packages. My main concern here is not the time that it takes to run these builds but the instability of the auto PR tester described some in #3276 . For example, just today, the PRs #3546 and #3549 which just changed files under Trilinos/cmake/std/atdm/ failed several times due to various failures with the auto PR testing system that had nothing to do with the PR branches. But because, currently, no packages are enable or tested, I could just babysit these PRs add AT: RETEST over and over until they passed and got merged. But if these would have triggered the build and testing of all of the TRilinos PT packages, then this woudl have taken all day and into tomorrow (or more) to get these merged to 'develop' (like what happened with other PRs in the past before #3133 was complete). This will destroy the productivity on developing the ATDM Trilinos configuration settings.

Therefore, once these Trilinos failures in this build are all cleaned up, I would like to propose that we copy out the settings for this CUDA build from Trilinos/cmake/std/atdm/ into files under Trilinos/cmake/std/ and then have the Trilinos auto PR build use those and still ignore any changes under Trilinos/cmake/std/atdm/. I think the risks of breaking the ATDM Trilinos CUDA builds due to divergence of these two configurations are offset by the reduction in the productivity in the development and integraiton of the official ATDM Trilinos configurations.

Sound good? Concerns?

dridzal commented 5 years ago

@bartlettroscoe , are there instructions for getting accounts on white and ride, which are necessary to fix these issues?

etphipp commented 5 years ago

Request in WebCARS.

dridzal commented 5 years ago

Apparently, it takes a while for the accounts to activate after they're approved.

bartlettroscoe commented 5 years ago

@dridzal asked:

@bartlettroscoe , are there instructions for getting accounts on white and ride, which are necessary to fix these issues?

It is just WebCars. The machine 'white' is on the SON so any SNL employee should be able to get access. And the machine 'ride' is on the SRN so green-card employees should use 'ride' since that machine has less traffic these days.

trilinos / Trilinos

Set up a CUDA build for an auto PR build #2464

Description

Tasks:

Related Issues:

3065 is fixed and Zoltan is working.

3541: ShyLU_DD test build failures (just started today!)

3542: Stokhos runtime test failures

3543 ROL runtime test failures

3544: STK runtime test failures

3551: TrilinosCouplings runtime test failure

3552: Piro runtime test failure