Closed bartlettroscoe closed 7 years ago
@jhux2, do the MueLu developers see these CI failures or are they hidden but the other CDash failure Nightly failure emails?
We need to work with Kitware to get a better CDash email notification system set up. Nightly failures should likely be sent out as a digest at the end of the day and CI failures should be sent out ASAP.
@bartlettroscoe I'll fix this. Personally, I don't get CI emails.
Personally, I don't get CI emails.
Can we get your ORNL email back on the muelu-regressions mail list?
Hmm, apparently I'm not subscribed to quite a few mailing lists. I'll fix it.
Fixed pushed in cc153ef.
@jhux2, do the MueLu developers see these CI failures or are they hidden but the other CDash failure Nightly failure emails?
We need to work with Kitware to get a better CDash email notification system set up. Nightly failures should likely be sent out as a digest at the end of the day and CI failures should be sent out ASAP.
If these are failures from crf450.srn.sandia.gov, then yes, those failures go the muelu regression mailing list. I don't recall seeing this yesterday, but Thunderbird says one came in at 1:52pm PT on Dec. 7th. The bulk of the failures arrived after 5:30pm PT on 12/7.
@aprokop Thanks for fixing this.
They still seems to be failing:
Where are the fixing commits? If you reference the issue ID (i.e. #910) in the commit message, then GitHub will provide the linkage.
I am reopening until we can confirm these are fixed. In the future, we should not close a Issue for a CI failure until we see it passing in the CI build on CDash.
Taking this issue.
@jhux2, actually I'm already looking at this. Almost positive this is a result of a change I made re: LSTS.
Ok. I think this should fix it:
file Ifpack2_RILUK_def.hpp:
1346 if (! L_solver_.is_null ()) os << ", " << L_solver_->description ();
1347 if (! U_solver_.is_null ()) os << ", " << U_solver_->description ();
1348
1349 os << "}";
1350 return os.str ();
1351 }
@ambrad My checkin script is running, should I let it continue?
@jhux2, thanks! Sorry about the trouble. I agree, that's the fix: I was also setting up my checkin run, but have now killed it.
@ambrad No trouble at all.
@jhux2 So sorry. I could not run the checkin script, and forgot to add the the Ifpack2 file.
Looks like these MueLu tests are passing in the most recent CI build:
Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:
$ ./checkin-test-sems.sh --do-all --push
or use the remote pull, test, and push process to safely test and push from a RHEL 6 machine.
Looks like these MueLu tests are passing in the most recent CI build:
http://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=2648200
Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:
$ ./checkin-test-sems.sh --do-all --push
or use the remote pull, test, and push process to safely test and push from a RHEL 6 machine.
@bartlettroscoe The checkin script automatically started a complex build. I invoked it like you specify above. In this case, complex was overkill. Is there some way to disable certain builds?
Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos?
I did not have one yesterday (do have one today). I tried running checking from within a Docker container but it failed as it could not find ParMetis (Docker containers based on standard repos like Fedora don't provide parmetis).
I tried running checking from within a Docker container but it failed as it could not find ParMetis (Docker containers based on standard repos like Fedora don't provide parmetis).
@aprokop, that is a signal that you don't have the SEMS Env modules. What RHEL 6 machine is this? Is this an SNL machine? If so, someone with sudo should be able to get the SEMS env mounted. See:
If this is not as SNL RHEL 6 machine, I suspect that you might be able to rsync the contents over to your ORNL RHEL 6 machine. That might be an approach that could would work for non-SNL Trilinos developers. If you are interested, let's talk offline about how you might do that.
@bartlettroscoe This was not a RHEL6 machine, but I set up access to a SNL RHEL6 machine yesterday, so I'm good for the future.
but I set up access to a SNL RHEL6 machine yesterday, so I'm good for the future.
@aprokop, that is good. But it likely means that you will need to use the Alternative branch workflow involving GitHub repo to move your commits from your ORNL machine to the SNL RHEL 6 machine where you will run the checkin-test-sem.sh script from. Let me know if you have any problems with this approach or suggestions for making it better.
The checkin script automatically started a complex build. I invoked it like you specify above. In this case, complex was overkill. Is there some way to disable certain builds?
Hum, that should not be. Can we get on over S4B and pair program what is happening on your machine?
@bartlettroscoe, just to be clear, this was not actually a MueLu problem. I broke a MueLu test with a commit I made to Ifpack2. I made the mistake of disabling forward packages in my checkin process. I didn't mean to do that, but had been figuring something out in the checkin build and forgot to then remove that flag from my checkin invocation once I figured things out. Hence I mistakenly pushed to Ifpack2 without running anything other than Ifpack2 tests.
Do any MueLu developers not have access to a RHEL 6 machine with the SEMS Env where they can use checkin-test-sems.sh to safely push to Trilinos? Otherwise, please use:
@bartlettroscoe So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?
So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?
@jhux2, it should now be trivial to set up and use on a RHEL 6 machine that has the SEMS env. See:
Please give me feedback if you are willing.
As of now, this is the only viable approach for an effective CI system for Trilinos given current staff and tools.
So it seems the process for running the checkin script has changed slightly. Could you point me to some short documentation to make sure I have everything set up correctly?
@jhux2, it should now be trivial to set up and use on a RHEL 6 machine that has the SEMS env. See:
https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing
Please give me feedback if you are willing.
As of now, this is the only viable approach for an effective CI system for Trilinos given current staff and tools.
@bartlettroscoe Ah, the process has changed. So just to be clear, developers should not run Trilinos/checkin-test.py?
Information from @bartlettroscoe:
The new CI build is meant to be run with the checkin-test-sems.sh script:
https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing
That guarantees that everyone is using the identical SEMS env. I set it up this way to allow for the very rare case where someone without access to the SEMS env could install their own identical env (same GCC, OpenMPI, TPLs, etc.) and still use the checkin-test.py script (but that should be very rare).
If you want to run the straight checkin-test.py script and skip the complex build, you will need to run it with:
--default-builds=MPI_RELEASE_DEBUG_SHARED_PT
The build MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX is there to allow people to run the complex build as well for more detailed testing. But to run it, they have to be explicit with:
---default-builds=MPI_RELEASE_DEBUG_SHARED_PT, MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX
Just curious, why is it called "RELEASE_DEBUG"? Is it release, or is it debug?
Ah, the process has changed. So just to be clear, developers should not run Trilinos/checkin-test.py?
@jhux2, good point. I will remove that script so that people don't use it by default anymore. I will also send out an email to the Trilinos developers pointing to the new checkin-test-sems.sh script for those who still want to use the checkin-test.py script for pushing to Trilinos (which is still obviously very optional since many people have never used it).
Just curious, why is it called "RELEASE_DEBUG"? Is it release, or is it debug?
@mhoemmen, The build name MPI_RELEASE_DEBUG_SHARED_PT
maps to the configure args (in order):
-DTPL_ENABLE_MPI=ON \
-DCMAKE_BUILD_TYPE=RELEASE \
-DTrilinos_ENABLE_DEBUG=ON \
-DBUILD_SHARED_LIBS=ON \
-DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF \
This naming scheme was proven to be less confusing for people in the CASL project. Before, just calling this a DEBUG
build make people assume that you could effectively run a debugger on this build. You can't. The DEBUG
if for runtime debug checking (i.e. array bounds checking, ptr checking, etc.). But we want full compiler optimizations so that the tests run as fast as they can (subject to the runtime debug-mode checking).
Make sense?
@bartlettroscoe I understand you point, but it's confusing to me as it mixes with CMake build types and my assumption (wrong!) was that RELEASE_DEBUG
would imply RelWithDebInfo
CMake build type.
I understand you point, but it's confusing to me as it mixes with CMake build types and my assumption (wrong!) was that
RELEASE_DEBUG
would implyRelWithDebInfo
CMake build type.
@aprokop, Sorry about that. But in that case the build would have been called something like MPI_RELWITHDEBINFO_SHARED_PT
.
In the end, a name can't encode all the options that need to be set to pin down this build so you just have to look at the full set of configure arguments (which you can see in do-configure.base file that gets written by the checkin-test.py script).
@aprokop, Sorry about that. But in that case the build would have been called something like MPI_RELWITHDEBINFO_SHARED_PT.
In the end, a name can't encode all the options that need to be set to pin down this build so you just have to look at the full set of configure arguments (which you can see in do-configure.base file that gets written by the checkin-test.py script).
@bartlettroscoe Since you asked for feedback ... the (old) checkin script emits a ton of information, which is helpful when something goes wrong, and is I suppose saved in the log file checkin-test.out
.
What if instead, the script had minimal screen output, but did summarize what's happening and the types of builds that are enabled.
Something like ...
Updating Trilinos ... worked!
Configurations enabled: MPI_RELEASE_DEBUG_SHARED_PT MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX
Configuration 1: MPI_RELEASE_DEBUG_SHARED_PT
-DTPL_ENABLE_MPI=ON
-DCMAKE_BUILD_TYPE=RELEASE
-DTrilinos_ENABLE_DEBUG=ON
-DBUILD_SHARED_LIBS=ON
-DTrilinos_ENABLE_SECONDARY_STABLE_CODE=OFF
running cmake ... worked!
compilings ... worked!
etc.
Just my two cents.
Since you asked for feedback ... the (old) checkin script emits a ton of information, which is helpful when something goes wrong, and is I suppose saved in the log file checkin-test.out.
The new one does as well (actually, the underlying checkin-test.py script never changed).
Updating Trilinos ... worked!
Configurations enabled: MPI_RELEASE_DEBUG_SHARED_PT MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX
Configuration 1: MPI_RELEASE_DEBUG_SHARED_PT
-DTPL_ENABLE_MPI=ON
-DCMAKE_BUILD_TYPE=RELEASE
-DTrilinos_ENABLE_DEBUG=ON
-DBUILD_SHARED_LIBS=ON
-DTrilinos_ENABLE_SECONDARY_STABLE_CODE=OFF
running cmake ... worked!
compilings ... worked!
etc.
I agree and I think we are one the same page. A while back I worte the following backlog item for this:
that I listed here:
If interested, we can create a new TriBITS GitHub issue for this and flesh out the requirements.
But note that it may not make too much sense in doing a lot more development work for the checkin-test.py script. I am hearing that Trilinos is not going to be pushing for the SST model where everyone works on a topic branch, pushes branches to the GitHub repo, creates a PR, and then some automated system will test all of Trilinos for every change fully (even a single comment in a single text file) before merging to the 'develop' branch (which is exactly what I argued for in this document a long time ago). But it is unclear who will do this an when this might occur.
CC: @trilinos/framework, @trilinos/muelu
Description:
A push last night broke two MueLu tests. The most current CI build still shows the failures:
No one can use the checkin-test-sems.sh script to push while these tests are failing (unless they disable these tests which is what I did last night).