trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 565 forks source link

Failing MueLu tests in CI build starting 12/7/2016 #910

Closed bartlettroscoe closed 7 years ago

bartlettroscoe commented 7 years ago

CC: @trilinos/framework, @trilinos/muelu

Description:

A push last night broke two MueLu tests. The most current CI build still shows the failures:

No one can use the checkin-test-sems.sh script to push while these tests are failing (unless they disable these tests which is what I did last night).

bartlettroscoe commented 7 years ago

@jhux2, do the MueLu developers see these CI failures or are they hidden but the other CDash failure Nightly failure emails?

We need to work with Kitware to get a better CDash email notification system set up. Nightly failures should likely be sent out as a digest at the end of the day and CI failures should be sent out ASAP.

aprokop commented 7 years ago

@bartlettroscoe I'll fix this. Personally, I don't get CI emails.

bartlettroscoe commented 7 years ago

Personally, I don't get CI emails.

Can we get your ORNL email back on the muelu-regressions mail list?

aprokop commented 7 years ago

Hmm, apparently I'm not subscribed to quite a few mailing lists. I'll fix it.

aprokop commented 7 years ago

Fixed pushed in cc153ef.

jhux2 commented 7 years ago

@jhux2, do the MueLu developers see these CI failures or are they hidden but the other CDash failure Nightly failure emails?

We need to work with Kitware to get a better CDash email notification system set up. Nightly failures should likely be sent out as a digest at the end of the day and CI failures should be sent out ASAP.

If these are failures from crf450.srn.sandia.gov, then yes, those failures go the muelu regression mailing list. I don't recall seeing this yesterday, but Thunderbird says one came in at 1:52pm PT on Dec. 7th. The bulk of the failures arrived after 5:30pm PT on 12/7.

@aprokop Thanks for fixing this.

bartlettroscoe commented 7 years ago

They still seems to be failing:

Where are the fixing commits? If you reference the issue ID (i.e. #910) in the commit message, then GitHub will provide the linkage.

I am reopening until we can confirm these are fixed. In the future, we should not close a Issue for a CI failure until we see it passing in the CI build on CDash.

jhux2 commented 7 years ago

Taking this issue.

ambrad commented 7 years ago

@jhux2, actually I'm already looking at this. Almost positive this is a result of a change I made re: LSTS.

jhux2 commented 7 years ago

Ok. I think this should fix it:

file Ifpack2_RILUK_def.hpp:

1346   if (! L_solver_.is_null ()) os << ", " << L_solver_->description ();
1347   if (! U_solver_.is_null ()) os << ", " << U_solver_->description ();
1348 
1349   os << "}";
1350   return os.str ();
1351 }
jhux2 commented 7 years ago

@ambrad My checkin script is running, should I let it continue?

ambrad commented 7 years ago

@jhux2, thanks! Sorry about the trouble. I agree, that's the fix: I was also setting up my checkin run, but have now killed it.

jhux2 commented 7 years ago

@ambrad No trouble at all.

aprokop commented 7 years ago

@jhux2 So sorry. I could not run the checkin script, and forgot to add the the Ifpack2 file.

bartlettroscoe commented 7 years ago

Looks like these MueLu tests are passing in the most recent CI build:

Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:

$ ./checkin-test-sems.sh --do-all --push

or use the remote pull, test, and push process to safely test and push from a RHEL 6 machine.

jhux2 commented 7 years ago

Looks like these MueLu tests are passing in the most recent CI build:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2648200

Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:

$ ./checkin-test-sems.sh --do-all --push

or use the remote pull, test, and push process to safely test and push from a RHEL 6 machine.

@bartlettroscoe The checkin script automatically started a complex build. I invoked it like you specify above. In this case, complex was overkill. Is there some way to disable certain builds?

aprokop commented 7 years ago

Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos?

I did not have one yesterday (do have one today). I tried running checking from within a Docker container but it failed as it could not find ParMetis (Docker containers based on standard repos like Fedora don't provide parmetis).

bartlettroscoe commented 7 years ago

I tried running checking from within a Docker container but it failed as it could not find ParMetis (Docker containers based on standard repos like Fedora don't provide parmetis).

@aprokop, that is a signal that you don't have the SEMS Env modules. What RHEL 6 machine is this? Is this an SNL machine? If so, someone with sudo should be able to get the SEMS env mounted. See:

If this is not as SNL RHEL 6 machine, I suspect that you might be able to rsync the contents over to your ORNL RHEL 6 machine. That might be an approach that could would work for non-SNL Trilinos developers. If you are interested, let's talk offline about how you might do that.

aprokop commented 7 years ago

@bartlettroscoe This was not a RHEL6 machine, but I set up access to a SNL RHEL6 machine yesterday, so I'm good for the future.

bartlettroscoe commented 7 years ago

but I set up access to a SNL RHEL6 machine yesterday, so I'm good for the future.

@aprokop, that is good. But it likely means that you will need to use the Alternative branch workflow involving GitHub repo to move your commits from your ORNL machine to the SNL RHEL 6 machine where you will run the checkin-test-sem.sh script from. Let me know if you have any problems with this approach or suggestions for making it better.

bartlettroscoe commented 7 years ago

The checkin script automatically started a complex build. I invoked it like you specify above. In this case, complex was overkill. Is there some way to disable certain builds?

Hum, that should not be. Can we get on over S4B and pair program what is happening on your machine?

ambrad commented 7 years ago

@bartlettroscoe, just to be clear, this was not actually a MueLu problem. I broke a MueLu test with a commit I made to Ifpack2. I made the mistake of disabling forward packages in my checkin process. I didn't mean to do that, but had been figuring something out in the checkin build and forgot to then remove that flag from my checkin invocation once I figured things out. Hence I mistakenly pushed to Ifpack2 without running anything other than Ifpack2 tests.

jhux2 commented 7 years ago

Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:

@bartlettroscoe So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?

bartlettroscoe commented 7 years ago

So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?

@jhux2, it should now be trivial to set up and use on a RHEL 6 machine that has the SEMS env. See:

Please give me feedback if you are willing.

As of now, this is the only viable approach for an effective CI system for Trilinos given current staff and tools.

jhux2 commented 7 years ago

So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?

@jhux2, it should now be trivial to set up and use on a RHEL 6 machine that has the SEMS env. See:

https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing

Please give me feedback if you are willing.

As of now, this is the only viable approach for an effective CI system for Trilinos given current staff and tools.

@bartlettroscoe Ah, the process has changed. So just to be clear, developers should not run Trilinos/checkin-test.py?

jhux2 commented 7 years ago

Information from @bartlettroscoe:

The new CI build is meant to be run with the checkin-test-sems.sh script:

   https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing

That guarantees that everyone is using the identical SEMS env. I set it up this way to allow for the very rare case where someone without access to the SEMS env could install their own identical env (same GCC, OpenMPI, TPLs, etc.) and still use the checkin-test.py script (but that should be very rare).

If you want to run the straight checkin-test.py script and skip the complex build, you will need to run it with:

   --default-builds=MPI_RELEASE_DEBUG_SHARED_PT

The build MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX is there to allow people to run the complex build as well for more detailed testing. But to run it, they have to be explicit with:

   ---default-builds=MPI_RELEASE_DEBUG_SHARED_PT, MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX
mhoemmen commented 7 years ago

Just curious, why is it called "RELEASE_DEBUG"? Is it release, or is it debug?

bartlettroscoe commented 7 years ago

Ah, the process has changed. So just to be clear, developers should not run Trilinos/checkin-test.py?

@jhux2, good point. I will remove that script so that people don't use it by default anymore. I will also send out an email to the Trilinos developers pointing to the new checkin-test-sems.sh script for those who still want to use the checkin-test.py script for pushing to Trilinos (which is still obviously very optional since many people have never used it).

Just curious, why is it called "RELEASE_DEBUG"? Is it release, or is it debug?

@mhoemmen, The build name MPI_RELEASE_DEBUG_SHARED_PT maps to the configure args (in order):

-DTPL_ENABLE_MPI=ON \
-DCMAKE_BUILD_TYPE=RELEASE \
-DTrilinos_ENABLE_DEBUG=ON \
-DBUILD_SHARED_LIBS=ON \
-DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF \

This naming scheme was proven to be less confusing for people in the CASL project. Before, just calling this a DEBUG build make people assume that you could effectively run a debugger on this build. You can't. The DEBUG if for runtime debug checking (i.e. array bounds checking, ptr checking, etc.). But we want full compiler optimizations so that the tests run as fast as they can (subject to the runtime debug-mode checking).

Make sense?

aprokop commented 7 years ago

@bartlettroscoe I understand you point, but it's confusing to me as it mixes with CMake build types and my assumption (wrong!) was that RELEASE_DEBUG would imply RelWithDebInfo CMake build type.

bartlettroscoe commented 7 years ago

I understand you point, but it's confusing to me as it mixes with CMake build types and my assumption (wrong!) was that RELEASE_DEBUG would imply RelWithDebInfo CMake build type.

@aprokop, Sorry about that. But in that case the build would have been called something like MPI_RELWITHDEBINFO_SHARED_PT.

In the end, a name can't encode all the options that need to be set to pin down this build so you just have to look at the full set of configure arguments (which you can see in do-configure.base file that gets written by the checkin-test.py script).

jhux2 commented 7 years ago

@aprokop, Sorry about that. But in that case the build would have been called something like MPI_RELWITHDEBINFO_SHARED_PT.

In the end, a name can't encode all the options that need to be set to pin down this build so you just have to look at the full set of configure arguments (which you can see in do-configure.base file that gets written by the checkin-test.py script).

@bartlettroscoe Since you asked for feedback ... the (old) checkin script emits a ton of information, which is helpful when something goes wrong, and is I suppose saved in the log file checkin-test.out.

What if instead, the script had minimal screen output, but did summarize what's happening and the types of builds that are enabled.

Something like ...

Updating Trilinos ... worked!
Configurations enabled:  MPI_RELEASE_DEBUG_SHARED_PT  MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX

Configuration 1:   MPI_RELEASE_DEBUG_SHARED_PT

     -DTPL_ENABLE_MPI=ON
     -DCMAKE_BUILD_TYPE=RELEASE
     -DTrilinos_ENABLE_DEBUG=ON
     -DBUILD_SHARED_LIBS=ON
     -DTrilinos_ENABLE_SECONDARY_STABLE_CODE=OFF

running cmake ... worked!
compilings ... worked!
etc.

Just my two cents.

bartlettroscoe commented 7 years ago

Since you asked for feedback ... the (old) checkin script emits a ton of information, which is helpful when something goes wrong, and is I suppose saved in the log file checkin-test.out.

The new one does as well (actually, the underlying checkin-test.py script never changed).

Updating Trilinos ... worked!
Configurations enabled:  MPI_RELEASE_DEBUG_SHARED_PT  MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX

Configuration 1:   MPI_RELEASE_DEBUG_SHARED_PT

     -DTPL_ENABLE_MPI=ON
     -DCMAKE_BUILD_TYPE=RELEASE
     -DTrilinos_ENABLE_DEBUG=ON
     -DBUILD_SHARED_LIBS=ON
     -DTrilinos_ENABLE_SECONDARY_STABLE_CODE=OFF

running cmake ... worked!
compilings ... worked!
etc.

I agree and I think we are one the same page. A while back I worte the following backlog item for this:

that I listed here:

If interested, we can create a new TriBITS GitHub issue for this and flesh out the requirements.

But note that it may not make too much sense in doing a lot more development work for the checkin-test.py script. I am hearing that Trilinos is not going to be pushing for the SST model where everyone works on a topic branch, pushes branches to the GitHub repo, creates a PR, and then some automated system will test all of Trilinos for every change fully (even a single comment in a single text file) before merging to the 'develop' branch (which is exactly what I argued for in this document a long time ago). But it is unclear who will do this an when this might occur.