Address basic stability of Trilinos 'develop' branch short-term

bartlettroscoe commented 7 years ago

Related to: #1362

Next Action Status:

The auto PR testing process (#1155) is deployed and is working fairly well to stabilize 'develop' (at least as good or better than the checkin-test-sems.sh script did). Further improvements will be worked in other issues.

Description:

This story is to discuss and decide how to address stability problems of the Trilinos 'develop' branch in the short term. I know there is a long-term plan to use a PR model (see #1155) but since there are no updates or ETA on that, we need to address stability issues faster than that.

Currently there have been a good deal of stability problems of the Trilinos 'develop' branch, even with the basic CI build linked to from:

https://github.com/trilinos/Trilinos/wiki/Policies--%7C-Testing#post_push_ci_testing

and the "Clean" builds shown here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=2&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Clean&field2=buildstarttime&compare2=84&value2=now

The "Clean" builds have never been clean in the entire history of the track.

Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).

We need a strategy to improve stability right now. I have been helping people set up to use the checkin-test-sems.sh script to test and push their changes. I would estimate that a large percentage of the failures (and 100% of the CI failures) seen on CDash would be avoided by usage of the checkin-test-sems.sh script.

CC: @trilinos/framework

jhux2 commented 7 years ago

Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).

Just to clarify, issue #1290 is fixed, but I am leaving the report open until the dashboard is clean. #1301 was opened on May 8.

bartlettroscoe commented 7 years ago

Just to clarify, issue #1290 is fixed, but I am leaving the report open until the dashboard is clean. #1301 was opened on May 8.

These were just recent examples. There are other older significant breakages that we can reference here as well.

lucbv commented 7 years ago

@bartlettroscoe, it seems fine to me to use the checkin-test-sems.sh script to do my tests and push but I have two remarks regarding it: 1) so far I have not been able to pass arguments properly with it. For instance I had tests timing out for a while and passing --ctest-timeout=600 as argument or setting it in local-checkin-test-defaults.py did not change the timeout constraint (it remained 300s). As result the only way I could pass these test was to run with -j 1 (which suprisingly works) making everything even slower... 2) even though I use the checkin script and caught a few mistakes before pushing, I was not able to test all configurations and still broke a few things on CDash...

Is there an option to automatically quarantine commits until they no longer break things on CDash?

bartlettroscoe commented 7 years ago

so far I have not been able to pass arguments properly with it. For instance I had tests timing out for a while and passing --ctest-timeout=600 as argument or setting it in local-checkin-test-defaults.py did not change the timeout constraint (it remained 300s). As result the only way I could pass these test was to run with -j 1 (which suprisingly works) making everything even slower...

Okay, we need to figure this out. First, having to increase the timeout to 600s or change to -j1 is a bad sign. What is it about your machine or your env that would have you need to do this? I can contact you offline and we can work through those issues and discuss your situation and find a good way to address your personal situation. It should not take long.

even though I use the checkin script and caught a few mistakes before pushing, I was not able to test all configurations and still broke a few things on CDash

Of course a single build configuration is not going to catch everything. If it did, we would not need the other builds :-)

But I suspect that, historically, 50% or more of the new failures that show up on CDash would have been caught by the single CI build. And of the major Trilinos failures reported by ATDM customers and highlighted by them in open meetings over the past several months, most of these would have been caught by checkin-test-sems.sh. We can eliminate these embarrassing failures by just using checkin-test-sems.sh.

My experience over 15+ years has been that if your C++ code builds with high warning levels and with no warnings, and have good tests, and all tests pass with a recent GCC version, then you have few hard portability problems for other compilers on other platforms. NGA machines are a different issue of courses. For that, we really need to test some of these (like CUDA) as part of a PR testing model that should be getting set up in #1155. But given that Kokkos is supposed to create a portable abstraction for porting to these different NGA machines and envs, then there should be less porting problems for downstream packages.

Anyway, I will contact you to talk through your situation.

bartlettroscoe commented 7 years ago

FYI:

We finally have a 100% clean from-scratch CI build for Trilinos this morning shown here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2887359&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

We had not had that since the morning of May 4 shown here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2875809&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

That is 6 days. That is what we need to improve going forward in the short-term.

I really wish we could follow SST's lead and implement #1155 and force all pushes to Trilinos to use PRs and pass 100% tests before being merged. But since that story is not even started yet (and we have been discussing that for 9+months now), we need to make the most of what we have, and that is the checkin-test-sems.sh script.

mhoemmen commented 7 years ago

@lucbv wrote:

As result the only way I could pass these test was to run with -j 1 (which suprisingly works) making everything even slower...

Sometimes it helps to set OMP_NUM_THREADS to some reasonable value. I like 4 or 8 -- enough to make threads work, but not enough to hinder tests. Adjust as appropriate for your system.

bartlettroscoe commented 7 years ago

As result the only way I could pass these test was to run with -j 1 (which suprisingly works) making everything even slower...

Sometimes it helps to set OMP_NUM_THREADS to some reasonable value. I like 4 or 8 -- enough to make threads work, but not enough to hinder tests. Adjust as appropriate for your system.

If that is what is happening, then we likely need to add some code to checkin-test-sems.sh to set OMP_NUM_THREADS to a reasonable value that will be uniform for everyone. For a standard CI build, uniformity is a primary concern.

mhoemmen commented 7 years ago

If that is what is happening, then we likely need to add some code to checkin-test-sems.sh to set OMP_NUM_THREADS to a reasonable value that will be uniform for everyone.

I'm OK with that, as long as (a) it respects the test hardware, and (b) is > 1 by default (again, respecting the hardware). The latter will become more and more important as more of Trilinos gets thread parallelized. I'm debugging some thread-parallel code I want to add to Trilinos, so it's on my mind ;-) .

bartlettroscoe commented 7 years ago

If that is what is happening, then we likely need to add some code to checkin-test-sems.sh to set OMP_NUM_THREADS to a reasonable value that will be uniform for everyone.

I'm OK with that, as long as (a) it respects the test hardware, and (b) is > 1 by default (again, respecting the hardware). The latter will become more and more important as more of Trilinos gets thread parallelized. I'm debugging some thread-parallel code I want to add to Trilinos, so it's on my mind ;-) .

Actually, OpenMP is NOT enabled for the Standard CI build. You can see that on CDash, for example, at:

http://testing.sandia.gov/cdash/viewNotes.php?buildid=2887360##note2

which shows:

Trilinos_ENABLE_OpenMP:BOOL=OFF

What is enabled is Pthreads as shown, for example, at:

http://testing.sandia.gov/cdash/viewConfigure.php?buildid=2887360

which shows:

Processing enabled TPL: Pthread (enabled explicitly, disable with -DTPL_ENABLE_Pthread=OFF)

Does Kokkos use Pthreads for threading? If so, what controls how many thread Pthreads uses?

Should the standard CI build enable OpenMP instead? If so, how many threads should we set for OMP_NUM_THREADS for the standard CI build?

bartlettroscoe commented 7 years ago

And the Trilinos build is broken again:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2889905&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

That did not take long. I am looking at who pushed the breaking commits now ...

bartlettroscoe commented 7 years ago

@etphipp, looks like the latest CI build failures:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2889905&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

are due to your commit:

ad18536:  Sacado:  Replace if_then with if_then_else and add Fad overloads.
Author: Eric Phipps <etphipp@sandia.gov>
Date:   Thu May 11 10:37:00 2017 -0600

M   packages/sacado/src/Sacado_Fad_Ops.hpp
M   packages/sacado/src/Sacado_cmath.hpp

as shown at:

http://testing.sandia.gov/cdash/viewNotes.php?buildid=2889906##note1

Can you please fix or back out soon?

etphipp commented 7 years ago

Yes I am well aware of this and have been thinking of a fix. This is due to the inability of gcc 4.7.2 to properly compile c++11 code.

bartlettroscoe commented 7 years ago

Yes I am well aware of this and have been thinking of a fix. This is due to the inability of gcc 4.7.2 to properly compile c++11 code.

We will be switching the CI build to GCC 4.9.3 shortly here (see #1002). Can you test with GCC 4.9.3 before you push?

But note that there will still be "Clean" builds as shown at:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2017-05-11

that use GCC 4.7.2. That is because SIERRA may require support for C++11 and GCC 4.7.2 for some time yet.

I guess keeping compatibility with these older compilers is really being driven by our major customers at Sandia outside of ATDM. So we have to keep these compilers working as well. Testing with the oldest GCC compiler that we need to continue to support before we push would seem like the best process.

etphipp commented 7 years ago

Yes I did test with gcc 4.9.3 before I pushed. I did not test with gcc 4.7 before pushing, and I won't in the future.

etphipp commented 7 years ago

The build error I introduced with gcc 4.7 in sacado is fixed.

bartlettroscoe commented 7 years ago

Yes I did test with gcc 4.9.3 before I pushed. I did not test with gcc 4.7 before pushing, and I won't in the future.

Then good thing I am switching the CI build to GCC 4.9.3 :-)

bartlettroscoe commented 7 years ago

@lucbv and I met last Friday. He should be all set up on his RHEL 7 Linux machine to run the checkin-test-sems.sh script to test and push using checkin-test-sems.sh --do-all --push. He was able to run tests with -j14 with success. But if he sees timeouts like he reported above, then he will let me know and we will work through this.

bartlettroscoe commented 7 years ago

Now we have 3 new failing ROL tests in the CI build as shown by:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2899117

which are:

ROL_example_burgers-control_example_04_MPI_1 Failed 1s 840ms Completed (Failed)
ROL_example_PDE-OPT_topo-opt_elasticity_example_01_MPI_4 Failed 10m 90ms Completed (Timeout)
ROL_test_sol_solSROMGenerator_MPI_1 Failed 10m 40ms Completed (Timeout)

@trilinos/rol

bartlettroscoe commented 7 years ago

And this morning we have a new failing Teko test:

Teko_DiagonallyScaledPreconditioner_MPI_1 Failed 1s 410ms Completed (Failed)

as shown by:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2899180

CC: @trilinos/teko

bartlettroscoe commented 7 years ago

Following on from above ...

Looking at the filing test output for Teko_DiagonallyScaledPreconditioner_MPI_1 Failed at:

http://testing.sandia.gov/cdash/testDetails.php?test=38800087&build=2899195

and at the commits pushed in this CI iteration, it looks like the breaking commit is likely:

commit 56bf6b4cb07c20ed6900407a51d0f9bf6805ce75
Author: Heidi K. Thornquist <hkthorn@sandia.gov>
Date:   Mon May 15 16:08:33 2017 -0600

    Update options for example to include pseudo-block GMRES and ortho options

    The example is more flexible with regards to the GMRES implementation and
    the orthogonalization method being used by GMRES.

 100.0% packages/belos/epetra/example/BlockGmres/

@hkthorn, can you please take a look at this test and see about fixing it?

eric-c-cyr commented 7 years ago

It seems that the tolerances are just being missed. My WAG is that the defaults used by stratimikos in Belos changed in some way.

bartlettroscoe commented 7 years ago

It seems that the tolerances are just being missed. My WAG is that the defaults used by stratimikos in Belos changed in some way.

Right. The tolerance test is just barely failing:


  *** Entering LinearOpTester<double,double>::compare(op1,op2,...) ...

  describe op1:
   Teko::PreconditionerLinearOp<double>{rangeDim=50,domainDim=50}
    [Operator] = "(absRowSum(  )))*(ANYM)": Thyra::DefaultMultipliedLinearOp<double>{numOps=2,rangeDim=50,domainDim=50}
      Constituent LinearOpBase objects for M = Op[0]*...*Op[numOps-1]:
       Op[0] = "absRowSum(  ))": Thyra::DefaultDiagonalLinearOp<double>
       Op[1] = Thyra::DefaultInverseLinearOp<double>{rangeDim=50,domainDim=50}:
         lows = Thyra::BelosLinearOpWithSolve<double>{iterativeSolver='"Belos::PseudoBlockGmresSolMgr": {Num Blocks: 300, Maximum Iterations: 1000, Maximum Restarts: 20, Convergence Tolerance: 1e-08}',fwdOp='Thyra::TpetraLinearOp<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >{rangeDim=50,domainDim=50}'}

  describe op2:
   Thyra::DefaultInverseLinearOp<double>{rangeDim=50,domainDim=50}:
    lows = Thyra::BelosLinearOpWithSolve<double>{iterativeSolver='"Belos::PseudoBlockGmresSolMgr": {Num Blocks: 300, Maximum Iterations: 1000, Maximum Restarts: 20, Convergence Tolerance: 1e-08}',fwdOp='Thyra::TpetraLinearOp<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >{rangeDim=50,domainDim=50}'}

  Checking that range and domain spaces are compatible ... 
  op1.domain()->isCompatible(*op2.domain()) ? passed

  op1.range()->isCompatible(*op2.range()) ? passed

  Checking that op1 == op2 ... 
  Checking that op1 and op2 produce the same results:

    0.5*op1*v1 == 0.5*op2*v1
    \________/    \________/
        v2            v3

     norm(v2-v3) ~= 0

  Random vector tests = 1

   Testing relative error between vectors v2.col[0] and v3.col[0]:
    ||v2.col[0]|| = 78.8876
    ||v3.col[0]|| = 78.8876
    Check: rel_err(v2.col[0],v3.col[0]) = 6.59022e-14 <= linear_properties_error_tol() = 5e-14 : FAILED

  Oh no, these two LinearOpBase objects seem to be different (see above failures)!

Should we just loosen the tolerance a little more?

hkthorn commented 7 years ago

@bartlettroscoe @eric-c-cyr The addition of a test is not causing this error. The DGKS orthogonalization, which is the default, was inappropriately and inefficiently applying ICGS2 orthogonalization.

bartlettroscoe commented 7 years ago

@hkthorn:

The addition of a test is not causing this error

What do you mean by "addition of a test"? That was an existing test that was passing in the previous CI iteration that was passing.

eric-c-cyr commented 7 years ago

I'm fine with relaxing the tolerance.

bartlettroscoe commented 7 years ago

@eric-c-cyr:

I'm fine with relaxing the tolerance.

Who wants to update this?

eric-c-cyr commented 7 years ago

I pushed something; however, I couldn't recreate the bug. Hopefully this works.

bartlettroscoe commented 7 years ago

I pushed something; however, I couldn't recreate the bug. Hopefully this works.

@eric-c-cyr, thanks for doing that. I will be pushing with the checkin-test-sems.sh script this morning so I will test if the issue is resolved as well. But with floating point diffs like this, I think we can see some variability between different Linux machines even if they are all using the exact same SEMS env.

bartlettroscoe commented 7 years ago

I will be pushing with the checkin-test-sems.sh script this morning so I will test if the issue is resolved as well.

My push went through and this Teko test did not diff and that Teko test is shown newly passing in the current CI iteration:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2902912

hkthorn commented 7 years ago

@eric-c-cyr @bartlettroscoe Thanks for loosening the tolerance. I could not recreate the error yesterday, but the Belos solver is converging to the specified tolerance of 1e-8, while the Teko test is using an absolute tolerance. The most reasonable path forward is loosening the Teko tolerance.

bartlettroscoe commented 7 years ago

@hkthorn:

I could not recreate the error yesterday

Just curious, but did you try to reproduce this on your Linux RHEL machine with the checkin-test-sem.sh script (e.g. using ./checkin-test-sems.sh --no-enable-fwd-packages --enable-packages=Teko --local-do-all) or did you use some other approach? It would be nice to know if the exact same SEMS env on different Linux machines results in this level off difference in floating point errors.

hkthorn commented 7 years ago

@bartlettroscoe Sorry, I was using my Mac laptop and a testing script.

bartlettroscoe commented 7 years ago

FYI: Last week, an SNL manager in charge of the Trilinos Framework effort said it could be a year before the pull-request testing for Trilinos described in #1155 gets implemented. Therefore, I think we have no choice but to continue to move forward with the usage of the checkin-test.py script to help stabilize the Trilinos 'develop' branch.

Related to this, I sat down with the ROL developers on Friday to show them how to run the checkin-test-sems.sh script and helped @dpkouri set up to run the remote pull/test/push process. (One of the side benefits of this was I showed them how to use SSH keys to access github to avoid having to type in HTTPS passwords and speed up git transfer operations.) I look forward to the opportunity to help other developers set this up as well.

Given that we may not have a PR implementation for some time and have to rely on the checkin-test.py script for the next year or so, I will put some time into making the checkin-test.py script a little nicer to run (e.g. by reducing the amount of STDOUT output) and improve the --help documentation (like I did for gitdist in TriBITSPub/TriBITS#132 a while back). That, together with setting up targeted Trilinos builds for ATDM in https://github.com/trilinos/Trilinos/milestone/45 will have to be good enough for the next year or so until the Trilinos Framework team can get a good PR workflow implemented.

eric-c-cyr commented 7 years ago

From an application perspective, what will be the stable branch we should use for the next year or so?

bartlettroscoe commented 7 years ago

From an application perspective, what will be the stable branch we should use for the next year or so?

In my opinion, that depends on the funding program, the application code, and the type of testing done for Trilinos w.r.t. that developer and customer sub-community. We need to have an in-depth conversation about how ATDM codes and apps, for example, should be interacting with the main Trilinos 'develop' and 'master' branches. I will bring this up at upcoming SART and ATDM meetings.

bartlettroscoe commented 7 years ago

Now Panzer is failing in the CI build. See #1374.

bartlettroscoe commented 7 years ago

Looks like the Panzer failure in #1374 would have been caught by using ./checkin-test-sems.sh --do-all --push to test and push (see https://github.com/trilinos/Trilinos/issues/1374#issuecomment-305332817 for details).

bartlettroscoe commented 7 years ago

FYI: I pushed f690932 so that no-one will have their pushes stopped due to this. See #1374 for more details.

bartlettroscoe commented 7 years ago

FYI: The CI build is once again clean (see https://github.com/trilinos/Trilinos/issues/1374#issuecomment-305806496).

NOTE: The fixing commits were pushed with the standard CI build enabling all downstream packages so success in cleaning up the post-push CI build was about guaranteed.

bartlettroscoe commented 7 years ago

FYI: I now have my personal CDash account set up so that I will get an email right away when the CI server reports a failure. That way, I can respond much faster where there is a CI failure due to a bad push and I can put in the needed pre-push disables so that this does not trip up others trying to use the checkin-test-sems.sh script to test and push.

bartlettroscoe commented 7 years ago

It seems that there is a view by some that upstream package developers should not be expected to test downstream packages like Panzer before pushing changes to the Trilinos 'develop' branch. If that is true, then what is the justification for doing this? Is it because of the long build times for Trilinos packages? Or is it because upstream package developers might get their push stopped because of failing downstream tests that they did not break? Is there some other reason?

The work that we are going through to improve the adoption of the checkin-test-sems.sh script fixes the second problem. And we are seeing a cleaner CI build and we are taking action more quickly to disable already failing downstream package tests to not block pushing changes to upstream packages.

That brings us to the issue of long build times. I would argue that if long build times are causing developers to disable downstream packages in their testing, then the right response is to put resources on improving the build times by working the stories in:

https://github.com/trilinos/Trilinos/milestone/22

and helping downstream package developers to use upstream packages better to reduce build times (e.g. use factories instead of pulling in raw header files like may be the problem for long build times of ROL). That could have a huge impact and Trilinos users are greatly hurting because of long build times (and the reputation of Trilinos is taking a beating because of this).

The wrong response is to simply disable downstream packages. That just further destabilizes Trilinos and causes problems for downstream customers that rely on those packages (like ATDM on Panzer).
Just disabling downstream packages provides the wrong intensive for upstream package developers to reduce the cost of the build of their packages. They need to feel some of the pain to help encourage the right behavior (or to help downstream developers use their packages better to reduce build times).

Computers are cheap compared to people's time and a full from-scratch checkin-test-sems.sh --do-all --push on a standard CEE LAN machine takes less than 2 hours. And broken code is not just $$ cost, it is time cost which is even worse in many cases.

bartlettroscoe commented 7 years ago

Someone made a changes to the apache configuration on testing.sandia.gov yesterday and broke all the submissions of test results to the CDash site. Not clear who did this but our handy Kitware contractor fixed the issue (after I suggested it might be due to a change in http to https mapping that changed yesterday that broke links for the http:// address).

Therefore, I have restarted the CI server running on ceerws1113 and it is already producing results shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2939367

This just shows that you need to keep a careful eye on CDash throughout the day.

bartlettroscoe commented 7 years ago

FYI: There was a build CI failure in Teko this morning addressed in #1449. This has already been resolved and should not block anyone's pushes using checkin-test-sems.sh --do-all --push.

NOTE: I set up my own CDash account on testing.sandia.gov/cdash/ and signed up to get all emails from all CDash builds. I then created an Outlooks email filter to send all of those emails to a folder except for this standard CI build. Because of that, got the following email this morning and I was able to triage this failure and resolve it in less than an hour. (This is what the @trilinos/framework on-call team members need to do as well. I will create a GitHub issue explaining this approach soon.)

From: CDash [mailto:trilinos-regression@sandia.gov] Sent: Thursday, June 22, 2017 8:14 AM To: Bartlett, Roscoe A rabartl@sandia.gov Subject: FAILED (b=1): Trilinos/Teko - Linux-GCC-4.9.3- MPI_RELEASE_DEBUG_SHARED_PT_CI - Continuous

A submission to CDash for the project Trilinos has build errors. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=2961631

Project: Trilinos SubProject: Teko Site: ceerws1113 Build Name: Linux-GCC-4.9.3-MPI_RELEASE_DEBUG_SHARED_PT_CI Build Time: 2017-06-22T12:01:53 UTC Type: Continuous Errors: 1

Error Error copying file "/scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/teko/tests/data/l sc_B_2.mm" to "/scratch/rabartl/Trilinos.base/SEMSCIBuild/BUILD/packages/teko/tests/data/ls c_B_2.mm".

-CDash on testing.sandia.gov

bartlettroscoe commented 7 years ago

One issue related to this story is the instability of the CI build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED that the Trilinos Framework team runs on the machine sadl30906.srn.sandia.gov as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=2&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now

While the CI build ceerws1113 that I run on machine ceerws1113 has never failed in the last 6 weeks, the CI build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED run on sadl30906.srn.sandia.gov has failed 23 times. A total of 22 of these failed CI builds are compiler crashes trying to build some ROL examples and one is a crash of MPI.

So while the code itself it not bad, having 23 false failures in the last month is still very bad because it teaches the ROL developers (@trilinos/rol) to ignore CDash error notification emails and otherwise makes it hard to see real failures. We can't have this situation going forward.

The best solution is to make the ROL example using something like Stratimikos to cut down on the build and link times. But that will take time and buy-in by the @trilinos/rol team. Therefore, unless the ROL team is ready to do this work soon, we need to make these failures go away some other way.

Therefore, because of the problems with building ROL examples, I would like to suggest that ROL be disabled in the Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED build run on sadl30906.srn.sandia.gov, until ROL examples can be fixed to build with less memory (e.g. by using Stratimikos). But because ROL is already tested in the other CI build and in the pre-oush checkin-test-sems.sh script, we really not not loosing any testing if we disable ROL in the Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED build.

Therefore, unless there is a strong objection, I will remove ROL form the Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED build run on sadl30906.srn.sandia.gov.

bartlettroscoe commented 7 years ago

The standard CI build was broken yesterday with build failures in Ifpack2 shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3018635

with the new commits pulled in this CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3018636##note2

which shows one commit:

b053e8f:  Ifpack2: Fixed unused typedef warnings
Author: Brian Kelley <bmkelle@sandia.gov>
Date:   Fri Jul 21 09:45:35 2017 -0600

M   packages/ifpack2/test/unit_tests/Ifpack2_UnitTestAmesos2solver.cpp

The push logger I set up (see #1362) showed this push:

Mon Jul 24 09:01:52 MDT 2017

commit b053e8f6d8a75d7d447fe649db54a01eaafe8294
Author:     Brian Kelley <bmkelle@sandia.gov>
AuthorDate: Fri Jul 21 09:45:35 2017 -0600
Commit:     Brian Kelley <bmkelle@sandia.gov>
CommitDate: Mon Jul 24 08:56:16 2017 -0600

    Ifpack2: Fixed unused typedef warnings

Commits pushed:
b053e8f Ifpack2: Fixed unused typedef warnings

As shown in that git log, there is no evidence that the checkin-test-sems.sh script was used for this push (which would explain the build failure).

Fortunately, the build errors were fixed in a later CI iteration yesterday shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3018839

which showed the updated commits at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3018840##note2

which showed the new commit pulled:

0ff86b6:  Ifpack2: Fixe build error from b053e8f
Author: Brian Kelley <bmkelle@sandia.gov>
Date:   Mon Jul 24 13:03:43 2017 -0600

M   packages/ifpack2/test/unit_tests/Ifpack2_UnitTestAmesos2solver.cpp

which corresponds to the push:

Mon Jul 24 13:04:22 MDT 2017

commit 0ff86b61f2d96050b74daee93f50fd266e9527bd
Author:     Brian Kelley <bmkelle@sandia.gov>
AuthorDate: Mon Jul 24 13:03:43 2017 -0600
Commit:     Brian Kelley <bmkelle@sandia.gov>
CommitDate: Mon Jul 24 13:04:04 2017 -0600

    Ifpack2: Fixe build error from b053e8f

Commits pushed:
0ff86b6 Ifpack2: Fixe build error from b053e8f

Fortunately, only the tests for Ifpack2 were broken, not the libraries. That allowed two pushes yesterday to downstream packages from Ifpack2 to go through which used the checkin-test-sems.sh script shown below:

Mon Jul 24 12:48:53 MDT 2017

commit 674270e7cb7aebcaa9992d93992ec84cac7943c8
Author:     Tobias Wiesner <tawiesn@sandia.gov>
AuthorDate: Mon Jul 24 09:58:26 2017 -0600
Commit:     Tobias Wiesner <tawiesn@sandia.gov>
CommitDate: Mon Jul 24 12:46:05 2017 -0600

    MueLu: add some more comments to MueLu_Test_ETI

    fix nightly tests regarding meaningless boolean success flag

    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=422,notpassed=0 (165.13 min)

Commits pushed:
674270e MueLu: add some more comments to MueLu_Test_ETI

Mon Jul 24 09:31:48 MDT 2017

commit 64136ca3911999f600d184084efe61fefb6ca10c
Merge: b053e8f 24c2d08
Author:     Curtis C. Ober <ccober@sandia.gov>
AuthorDate: Mon Jul 24 09:23:52 2017 -0600
Commit:     Curtis C. Ober <ccober@sandia.gov>
CommitDate: Mon Jul 24 09:30:46 2017 -0600

    Merge remote branch 'intermediate-repo/develop' into develop

    Build/Test Cases Summary
    Enabled Packages: Tempus
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=151,notpassed=0 (6.75 min)
    Other local commits for this build/test group: 24c2d08, aac23ee

Commits pushed:
64136ca Merge remote branch 'intermediate-repo/develop' into develop
24c2d08 Tempus: Reduce runtime to avoid "Timeout" in debug.
aac23ee Tempus: 2nd and 3rd order IMEX-RK is working!

However, if someone would have tried to push a change to one of the upstream packages to Ifpack2 which are:

Final set of enabled packages: Kokkos Teuchos KokkosKernels RTOp Epetra Zoltan Triutils Tpetra TrilinosSS EpetraExt Thyra Xpetra Isorropia AztecOO Galeri Amesos Pamgen Zoltan2 Ifpack ML Belos Amesos2 Ifpack2 23

then their push would have been blocked due to this breaking test build.

Therefore, we got lucky here in that:

Only the tests, not the libs, for Ifpack2 were broken
People who were using the checkin-test-sems.sh script to push while Ifpack2 tests were broken just happened to be testing packages downstream from Ifpack2 (and therefore did not enable the Ifpack2 tests)
The breaking build was fixed very quickly (only about 4 hours later)

bartlettroscoe commented 7 years ago

NOTE: We may not see any more ROL example build failures on Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED build run on sadl30906.srn.sandia.gov described above because ROL examples now using ETI (see #1514). We will watch for another month or so and see.

bartlettroscoe commented 7 years ago

@dridzal, did the change you mention in #1514 get into the Trilinos 'develop' branch? We saw another internal compiler error for ROL just earlier today:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3020624

mhoemmen commented 7 years ago

Does ROL still require GCC 4.7.2? Kokkos and Tpetra have since moved on to a minimum of GCC 4.9.3, if I remember right.

dridzal commented 7 years ago

@mhoemmen : ROL is at GCC 4.9.3. The core code will compile with GCC 4.4.x, but certain examples and adapters require later versions (nothing beyond 4.9.3).

@bartlettroscoe : The change in #1514 is in the ROL-Trilinos develop branch, however it hasn't been merged yet with the Trilinos develop branch. There was an additional change that reduced memory requirements even further. I hope that this will take care of sadl issues. We plan to merge later this week, or early next week. Do you need it sooner?

ibaned commented 7 years ago

I think the Trilinos CI build is actually using GCC 4.8.4, so that is the practical minimum version for important Trilinos packages.

trilinos / Trilinos

Address basic stability of Trilinos 'develop' branch short-term #1304