trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 566 forks source link

Select set of builds for initial mandatory auto PR testing process #2317

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/framework

Next Action Status

This ship has sailed on "Initial" a long time ago. The only remaining build in the CUDA PR build and that is being tracked in #2464.

Description

This story is to involve interested members of the Trilinos team to collaborate to select the best build configurations in order to allow the new Trilinos auto PR testing and merging tool and process (#1155) to become the mandatory way to test and push to the main Trilinos ‘develop’ branch (#2312). This selection must be done considering that there are limited computing resources (on the Jenkins build farms) to run automated builds. So while we would like to run many different useful builds as part of pre-push automated PR testing, we have to be strategic about what builds we run where to not overwhelm current capacity. As more build machines are added to the Jenkins build farm, more builds can be added to the auto PR testing process.

This story is a follow on from the action item:

in the 2018-02-26 Trilinos Planning Meeting.

The set of Trilinos team members interested in being part of this discussion (and meetings) include:

Since this selection of these builds will impact every Trilinos developer and every close customer and collaboration of Trilinos, it is important that we get input from many different people in making this selection.

Definition of Done

Related Issues

Task

  1. Find initial selection of team members interested to discuss this topic [DONE]
  2. Set up and have meeting with working group [DONE]
  3. Create initial set of builds in meeting [DONE]
  4. Create concrete *.cmake files for each proposed configuration and set up nightly builds sumitting to "Specialized" CDash Track/Group: a. GCC 4.8.4: See #2462 b. Intel 17.x: See #2463 c. CUDA: See #2464
  5. ???
bartlettroscoe commented 6 years ago

I sent the following email. I will give it until the week of 3/12/2-18 after the SIAM PP conference to have this meeting. That should be urgent enough.


To: sandia-trilinos-developers@software.sandia.gov Cc: trilinos-framework@software.sandia.gov Subject: [Trilinos-Framework] Interested in selection of builds used in mandatory auto PR testing and merge process?

Hello Sandia Trilinos Developers,

Following up on the action item I was given at the Trilinos Developers Planning meeting on Monday with notes listed at:

https://docs.google.com/document/d/1JClLSR3n79XJT_yLPPoQv_e3rJToHxl8fUzRYojtY5Y

I am going to organize a discussion on the selection of the set of builds the auto PR testing process should be using before it becomes the required way to test and push to the Trilinos ‘develop’ branch.

Therefore, if you are interested in being part of this discussion, please add your name in the first comment in the existing list under the “Description” section in the new issue:

https://github.com/trilinos/Trilinos/issues/2317

and then I will try to organize a meeting of all the interested parties.

Since this selection of build will impact every Trilinos developer and every close customer and collaboration of Trilinos, it is important that we get input from many different people in making this selection.

Thanks,

-Ross

csiefer2 commented 6 years ago

Any machine used for PR testing must be accessible (either directly or by an identical system) by all developers so they can fix issues that PR testing exposes.

mhoemmen commented 6 years ago

I'm with @csiefer2 -- for example, if we intend to support Windows builds, the Windows Dashboard build should be a VM that we can access (either log into, or download and run on our own) and use for testing.

bartlettroscoe commented 6 years ago

I don't think there is no way that we would make a Windows build a auto PR build or even a CI build anytime soon. There are so many other builds that more important than that.

But I think the rule should be that before any build can enter the "Nightly", "Clean", or "Auto PR" category, that it must be easy for any Sandia staff member to access a machine where the build can be exactly reproduced. I agree with @csiefer2 and @mhoemmen on this point.

bartlettroscoe commented 6 years ago

FYI: @trilinos/muelu

It appears that the current auto PR testing GCC 4.8.4 posting to CDash never runs any MueLu tests as shown at:

Not sure why that is because the standard CI build set up for last 1.5 has been running MueLu tests without any issues (except for real failures) as shown in recent history at:

bartlettroscoe commented 6 years ago

FYI: @trilinos/anasazi

It also appears that the auto PR builds are not running any Anazazi tests either as shown by:

bartlettroscoe commented 6 years ago

FYI: @trilinos/ifpack2

It also appears that the auto PR builds are not running any Ifpack2 tests as shown by:

jwillenbring commented 6 years ago

@bartlettroscoe Thank you for the reminder about the MueLu, Ifpack2, and Anasazi tests. These packages had test failures a while back and were disabled for that reason (needed 100% passing). Some of these issues I believe have been resolved. I am testing that now.

bartlettroscoe commented 6 years ago

Thank you for the reminder about the MueLu, Ifpack2, and Anasazi tests. These packages had test failures a while back and were disabled for that reason (needed 100% passing). Some of these issues I believe have been resolved. I am testing that now.

@jwillenbring,

Just the failing tests should be disabled, not all the tests for a package. You write a GitHub issue for the failing test(s), then disable just those tests. Generally one wants to disable with a scapulae, not a machete.

jwillenbring commented 6 years ago

Just the failing tests should be disabled, not all the tests for a package. You write a GitHub issue for the failing test(s), then disable just those tests. Generally one wants to disable with a scapulae, not a machete.

Agreed. At the time I didn't want the issues to be lost and not notice that the tests were disabled. Much of this has to do with ETI.

bartlettroscoe commented 6 years ago

At the time I didn't want the issues to be lost and not notice that the tests were disabled. Much of this has to do with ETI.

If the developers of those packages don't care to fix the broken tests for GitHub issues associated with their packages then that is on them. And you only disable the tests just for that particular build, not for all builds. Look at the git log file for cmake/std/BasicCiTestingSettings.cmake to see how to do this. These tests will still show up failing in "Clean", "Nightly" and other builds but at least they will not mess up other developers PR testing. That way, no one will forget these failed tests. But if they let the tests fail for more than a day or two, then these tests need to be disabled in "Clean" and "Nightly" as well.

It is important to fix this quickly because developers might be thinking that the auto PR builds are testing Ifpack2, Anasasi, and MueLu but they are not. This means that the auto PR builds have even a higher chance of breaking tests in these packages.

mhoemmen commented 6 years ago

Wait, what?!? Ifpack2 and MueLu tests were totally disabled?!?

bartlettroscoe commented 6 years ago

Hello, it is after SIAM PP. Can we set up a meeting on this?

bartlettroscoe commented 6 years ago

@jwillenbring and/or @bmpersc,

Do you guys want to be included in this meeting? If so, please add your GitHub IDs to the list in the above Description field. I will then try to set up a meeting using people's SNL calendars to select a time that works for everyone. (Or if I can't find a time, I will set up a dreaded doodle.com poll.)

bartlettroscoe commented 6 years ago

One issue to consider is that according to:

GCC 4.8.4 should implement OpenMP 3.1 while GCC 4.9.3 should implement OpenMP 4.0.

Is that an issue for testing OpenMP code with Trilinos? How much difference is there in the Kokkos implementation of threading of OpenMP 3.1 vs OpenMP 4.0? Or is this not a concern because updates to Kokkos do testing with many different compilers before merging to Trilinos 'develop'?

If we are not worried about the OpenMP 3.1 vs. OpenMP 4.0 issue, then a single GCC 4.8.4 build for auto PR testing would seem to be okay.

nmhamster commented 6 years ago

@bartlettroscoe - OpenMP3.1 would be good for now because we have a range of platforms where this is the most up to date complete supported standard.

bartlettroscoe commented 6 years ago

@trilinos/shylu developers,

Why are there no tests being run for any of the ShyLU packages in the existing auto PR and CI builds:

?

These packages are listed a Primary Tested (PT) packages but they don't have any tests?

ndellingwood commented 6 years ago

@bartlettroscoe each package in each of ShyLU's subpackages is marked either ST or EX:

shylu_dd

shylu_node

I'm not sure why they are being listed as PT packages, the use of

TRIBITS_PACKAGE_DEFINE_DEPENDENCIES(
  SUBPACKAGES_DIRS_CLASSIFICATIONS_OPTREQS
...

in the shylu_node and shylu_dd subpackages seems to fit the example from tribits dev guide example. Does anything stand out that clearly needs to be changed?

Opening a new issue after speaking with @srajama1 to get packages migrated to PT testing, #2375

Edit: Added reference to ShyLU issue.

bartlettroscoe commented 6 years ago

NOTE: Defects like #2391 show why we need to be enabling OpenMP in our auto PR build.

bartlettroscoe commented 6 years ago

CC: @trilinos/zoltan2

I was just reminded today by #2397 that the auto PR builds based on SEMS should also enable the Scotch TPL. That CI build (and the builds derived from it) are the only automated builds of Trilinos that enable Scotch and therefore are the only builds that run these tests that depend on Scotch. See #2065.

bartlettroscoe commented 6 years ago

Got feedback from @bathmatt and @micahahoward on the builds they would like to see to help protect ATDM APP SPARC and EMPIRE usage of Trilinos. The below email thread provides a converssation. In short, they would like to see following builds tested in the auto PR tester (in ATDM Trilinos build keywords terms):

@bathmatt and @micahahoward,

Would it be just a good if gnu and intel switched and we added debug-mode checking to the CMAKE_BUILD_TYPE=RELEASE to catch more undefined memory usage problems with:

?

That would allow people to run with debug-mode checking and OpenMP on machines that did not have an Intel compiler, or there was an Intel license sever problem (which happens from time to time). Is there something better about testing OpenMP with Intel than with GCC?

Also, is testing with a serial node (i.e. no Kokkos threading) really an important use case to test pre-merge to 'develop'? Is this a debugging tool in case something goes wrong with the OpenMP Kokkos node, even with OMP_NUM_THREADS=1? If OpenMP works, why would you not use that build for development, debugging, or running problems?

Detailed email thread (Click to expand) From: Howard, Micah Sent: Wednesday, March 21, 2018 2:17 PM To: Bartlett, Roscoe A Subject: Re: mandatory PR testing.. Yes that's okay with me. ---- From: "Bartlett, Roscoe A" Date: Wednesday, March 21, 2018 at 11:58 AM To: "Howard, Micah", "Bettencourt, Matthew" Subject: RE: mandatory PR testing.. Micah and Matt, The nightly ATDM builds of Trilinos we are setting up will cover all of these (and will completely cover them for ATDM once we include the stuff needed for SPARC as well). The issue is what can we afford to run for PR testing prior to merge. Once that becomes the mandatory way to test and push to Trilinos 'develop', there will be a lot of active PR branches at a time and that is going to likely require 10-20 builds on some days. We can manage that on the current Ted Bed or other hardware for that many different build configurations. The reduced list of three builds below may be doable for PR testing before updating the 'develop' branch. But there may not be enough computational resources to run even these three builds. I will ask the various ATDM DevOps supporting teams where we can find stable reliable resources to run these builds. (FYI: 'white' and 'ride' seem pretty unstable right now for running larger jobs with 'bsub'.) But no one seems to know how loaded these machines really are and there are no tools set up to really tell you that. After cleaning up a little, is okay If I archive this email chain in: https://github.com/trilinos/Trilinos/issues/2317 ? I think it would be good to get you guys on record with what you think will be valuable for ATDM. Thanks, -Ross --- From: Howard, Micah Sent: Wednesday, March 21, 2018 1:50 PM To: Bettencourt, Matthew ; Bartlett, Roscoe A Subject: Re: mandatory PR testing.. Sounds reasonable. For PR acceptance, note that: DEBUG-gcc-serial -> is my (4) with DEBUG REL-icc-openmp -> is my (2) with REL rel-cuda -> is my (3) with REL And arguably my (1) and (4) are redundant. --- From: "Bettencourt, Matthew" Date: Wednesday, March 21, 2018 at 11:45 AM To: "Howard, Micah", "Bartlett, Roscoe A" Subject: Re: mandatory PR testing.. I think we may want this level of testing for dev->master, my only concern with this level of testing is that it will be a long time between PR and acceptance into develop. My ideal situation is that we have a more rapid push into develop so garbage doesn't end up there, and an automated nightly upgrade to master based on Micah's suggestion. With this I'd use master for most everything, develop for co-develop work or short term bug fix testing (like the bug Roger found today) on my desktop and we could have rapid turn around. Therefore anything that blocks promotion from devel->master would be all hands on deck moment, things that break PR should be handled at that level. ---- On 03/21/2018 11:40 AM, Howard, Micah wrote: I sent this to Ross when I declined his meeting for tomorrow (I have a conflict too) 1) A surrogate for Trinity/HSW : intel-17.0.4, mpich2-3.2, Kokkos/Serial 2) A surrogate for Trinity/KNL: intel-17.04, mpich2-3.2, Kokkos/OpenMP 3) A surrogate for Sierra(ATS-2): gcc-7.2.0, openmpi-2.1 (I think), Kokkos/CUDA-9.x 3b) I'd like to have coverage for IBM's XL compiler and MPI along with CUDA-9.x but this dev stack is not stable yet 4) A surrogate for a 'standard' Linux env (this encompasses CEE, TLCC2 & CTS-1 systems): gcc or clang, openmpi, Kokkos/Serial, specific versions of gcc or clang and openmpi TBD So very similar to what Matt suggested. I'd add that REL and DEBUG are both useful for these cases. Complete testing with DEBUG for all cases might be asking a bit much but ensure DEBUG builds aren't broken is useful. Micah --- On 3/21/18, 11:26 AM, "Bettencourt, Matthew" wrote: Ross, I've got a conflict for you meeting on thur, I'd suggest DEBUG-gcc-serial REL-icc-openmp rel-cuda At least 1+3 since they are very different,
nmhamster commented 6 years ago

@bartlettroscoe - running with OpenMP enabled but intended to do MPI-only (i.e. OMP_NUM_THREADS=1) has been show to add fairly serious performance overhead because of reduced compiler optimization and lower efficiency (the OpenMP runtime is outlining functions which are not needed in the MPI only runs). For systems like ATS1/Haswell partition, we would strongly recommend MPI only (i.e. Serial) as the fastest configuration.

@bartlettroscoe - in general if we use a sufficiently new GCC, the difference in supported OpenMP features is fairly low. Using older GCC means that Intel would be a better choice.

bartlettroscoe commented 6 years ago

@nmhamster,

For systems like ATS1/Haswell partition, we would strongly recommend MPI only (i.e. Serial) as the fastest configuration.

Okay, so testing with a serial non-threaded Kokkos node is important to test. Is the compiler in that case always going to be some variant of GNU?

in general if we use a sufficiently new GCC, the difference in supported OpenMP features is fairly low.

I think GCC used for PR testing needs to be GCC 4.8.4, because that is the minimum required version of GCC for Kokkos. Is that considered "old" from an OpenMP perspective compared to say Intel 15.0..2 (which I think is the minimum version of Intel that Kokkos currently supports).

So which is better:

or

from this ATDM perspective?

nmhamster commented 6 years ago

@bartlettroscoe

Okay, so testing with a serial non-threaded Kokkos node is important to test. Is the compiler in that case always going to be some variant of GNU?

More likely to be Intel on ATS1 or CTS platforms. Rarely use GNU in production environments.

I think GCC used for PR testing needs to be GCC 4.8.4, because that is the minimum required version of GCC for Kokkos. Is that considered "old" from an OpenMP perspective compared to say Intel 15.0..2 (which I think is the minimum version of Intel that Kokkos currently supports).

Both are quite old but they have a fairly consistent set of OpenMP features. The more useful case is Intel 17.4 (with GCC 4.9.3 header files) because this is the current configuration used for production builds. Obviously, over time this will change but updates are infrequent.

nmhamster commented 6 years ago

@bartlettroscoe - the Intel comment relates to ATS1, for ATS2 then GCC is more likely. So its quite complicated as we move forward and have to support many platforms. Sorry to be a pain.

bartlettroscoe commented 6 years ago

@nmhamster,

More likely to be Intel on ATS1 or CTS platforms. Rarely use GNU in production environments.

So that means that intel-opt-serial is a more useful build to help support ATDM than gnu-opt-serial, right?

Remember, this is just the auto PR builds, not all of the builds. We are setting up automated ATDM-focused builds of Trilinos on every system and with every configuration that is important to ATDM. It is up to us in ATDM to decide what those builds should be.

nmhamster commented 6 years ago

@bartlettroscoe

So that means that intel-opt-serial is a more useful build to help support ATDM than gnu-opt-serial, right?

Yes, I think this would be more useful.

Remember, this is just the auto PR builds, not all of the builds. We are setting up automated ATDM-focused builds of Trilinos on every system and with every configuration that is important to ATDM. It is up to us in ATDM to decide what those builds should be.

Understood.

micahahoward commented 6 years ago

Not sure I can offer much that hasn't already been said. Let me know if you need anything from me.

mhoemmen commented 6 years ago

@bartlettroscoe The only issue with CUDA debug testing is that it's really REALLY slow (e.g., every Kokkos::View array access gets checked on the GPU), so you may need to increase the time-out.

bartlettroscoe commented 6 years ago

@bartlettroscoe The only issue with CUDA debug testing is that it's really REALLY slow (e.g., every Kokkos::View array access gets checked on the GPU), so you may need to increase the time-out.

@mhoemmen, or just perhaps only run the faster BASIC test suite? But I think the CUDA build should be cuda-opt, not cuda-debug. ATDM is running full cuda-debug and there are just a few issues with Panzer tests timing out (see #2318).

mhoemmen commented 6 years ago

@bartlettroscoe That's cool -- just note that the slowness comes from bounds checking and other Kokkos debug features, not from -g -O0. Thus, "release build with bounds checking on" will be nearly as slow as "debug build with bounds checking on."

bartlettroscoe commented 6 years ago

We had a meeting last Thursday to discuss the selection of the auto PR builds which involved the people @bartlettroscoe, @mhoemmen, @ibaned, @bmpersc, @crtrott, @jwillenbring, @rppawlo, @agsalin, and @william76. The builds we selected are shown in the Google Doc page:

In sujmary, we selected the following builds:

All of the builds have the common options:

I will create some new GitHub issues for each of these target builds to outline what needs to be done to get them ready to plug into the auto PR tester.

csiefer2 commented 6 years ago

I am uncomfortable testing with these options enabled:

Xpetra_ENABLE_Experimental=ON MueLu_ENABLE_Experimental=ON

We're still waiting on an explanation of why the apps want those...

bartlettroscoe commented 6 years ago

I am uncomfortable testing with these options enabled:

Xpetra_ENABLE_Experimental=ON MueLu_ENABLE_Experimental=ON

We're still waiting on an explanation of why the apps want those...

That is a negotiation that needs to take place between MueLu developers, ATDM APP developers and management. But as long as these are getting enabled in ATDM builds of Trilinos, we need to enable them in these builds as well to protect ATDM customers.

csiefer2 commented 6 years ago

Well consider my objection logged then :)

mhoemmen commented 6 years ago

@csiefer2 If it helps, I objected strenuously (alas, I don't think it will help, this sounds like one of those "SURPRISE WE'RE USING THIS NOW YOU'D BETTER SUPPORT IT" things).

csiefer2 commented 6 years ago

@mhoemmen Awww, thanks!

tawiesn commented 6 years ago

@csiefer2 @mhoemmen To make you feel more comfortable: it helps to grep through the Xpetra and MueLu source code for the Experimental flag.

Long in short: i want to encourage you to make the Experimental flag an Experimental flag, again ;-)

Happy easter holidays and greetings from Europe!

csiefer2 commented 6 years ago

@tawiesn : Matlab is not experimental. That's TPL guaded.

tawiesn commented 6 years ago

@csiefer2 That's right. It's marked to be experimental only in the ReleaseNotes.txt file. Even better!

csiefer2 commented 6 years ago

documentationisalwaysuptodate

mhoemmen commented 6 years ago

@tawiesn Thanks and happy Easter to you too! I'm not sure then why ATDM needs that flag. I would feel better if the ATDM build did not enable it, or if we could identify exactly what they need and make just that not experimental.

tawiesn commented 6 years ago

@mhoemmen

if we could identify exactly what they need and make just that not experimental.

I think that would be the right thing to do. And i am very sure it's not much work. But you have definitely more time/resources and the ability to run the tests than i have with my little tiny 8 year old computer.

If i have to make a guess: it's probably something in the MueLu Stratimikos interface which we use for years without problems and therefore can safely mark as non-experimental.

tawiesn commented 6 years ago

@mhoemmen @csiefer2

documentationisalwaysuptodate

Clean code principle: Don’t document your code. Code your documentation.

Believe it or not: that really works.

bartlettroscoe commented 6 years ago

FYI: @micahahoward last week told me that he things that using OpenMP with the Intel 17.x build would be better than with the GCC 4.8.4 build from an ATDM perspective.

Does anyone have an objection to making that change to the plan for auto PR builds?

bartlettroscoe commented 6 years ago

The EMPIRE build of Trilinos has removed the experimental Xpetra and MueLu enables and these have been removed from the ATDM and the CI build (see commits 7481c76 and 71c51f2). I updated the document:

to remove these options.

mhoemmen commented 6 years ago

@prwolfe @micahahoward Hooray! :D

bartlettroscoe commented 6 years ago

@crtrott and @nmhamster,

Since we can't turn on OpenMP in our auto PR builds yet due to the problem of pinning to the same cores as described in #2422, would a short-term alternative be to try to use Pthreads for threading with Kokkos? That is, use Pthread for the Kokkos host threading model?

Also, why does Kokkos not automatically use Pthread for threading when the TPL Pthread is enabled and no other threading model is specified (and OpenMP, CUDA, etc.) are not enabled or specified? You ca see that simply enabling the Pthread TPL does not turn on Kokkos usage of Pthread as shown in the recent auto PR build:

which shows:

Explicitly enabled TPLs on input (by user):  Pthread MPI BLAS LAPACK Boost ParMETIS Zlib HDF5 Netcdf SuperLU BoostLib DLlib 12

...

-- Setting KokkosCore_ENABLE_Pthread=ON since TPL_ENABLE_Pthread=ON

...

-- ****************** Kokkos Settings ******************
-- Execution Spaces
--   Device Parallel: None
--     Host Parallel: None
--       Host Serial: Serial
-- 
-- Architectures:
--     None
-- 
-- Enabled options
--   KOKKOS_ENABLE_SERIAL
--   KOKKOS_ENABLE_DEBUG
--   KOKKOS_ENABLE_PROFILING
--   KOKKOS_ENABLE_DEPRECATED_CODE

...

Therefore, enabling the Pthread TPL in this build does nothing, right?

mhoemmen commented 6 years ago

@bartlettroscoe wrote:

would a short-term alternative be to try to use Pthreads for threading with Kokkos?

Kokkos folks have been telling everyone not to use the Pthreads back-end for quite a while.

Also, why does Kokkos not automatically use Pthread for threading when the TPL Pthread is enabled and no other threading model is specified (and OpenMP, CUDA, etc.) are not enabled or specified?

The Pthread TPL gets automatically detected and enabled by default. (There are reasons to enable it that have nothing to do with thread parallelism.) Users found it unpleasantly surprising to get thread parallelism when they didn't ask for it explicitly. Thus, Kokkos and Tpetra historically did not enable the Pthreads back-end unless explicitly requested. Tpetra treats Kokkos::Threads as a last resort in terms of defaults.

More importantly, OpenMP is the only back-end that promises polite sharing of hardware with other libraries that use OpenMP. The Pthreads back-end does not share hardware resources nicely with other libraries that use threads (even Pthreads).

mhoemmen commented 6 years ago

@bartlettroscoe There is a bit of a fun history. "Kokkos 1.0" (Chris Baker's thread-parallel programming model) had the following back-ends originally:

The "OpenMPNode" came later and was actually an external contribution. Historically TBBNode didn't get exercised so much, because it required a TPL. CUDA was very poorly tested. This made TPINode a common thing for people to try. None of the Nodes performed particularly well in Tpetra until I specialized some of the Serial kernels. We eventually replaced Kokkos 1.0 with Kokkos 2.0 in Tpetra; the fact that Nodes still exist is mainly because there is no schedule for Trilinos 13.0 (we've given up & have plans to get rid of them by making them unnecessary).