trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

ROL MiniTensor examples/tests fail to build since 12/1/2016 push #899

Closed bartlettroscoe closed 7 years ago

bartlettroscoe commented 7 years ago

CC: @trilinos/framework, @trilinos/rol

Description:

ROL is failing to build in every Trilinos CI build since yesterday. See:

This was very likely caused by these commits:

54c2f7d "Intrepid2 MiniTensor: Add dot product operators between matrices and tensors."
Author: Alejandro Mota <amota@sandia.gov>
Date:   Wed Nov 30 17:41:26 2016 -0800 (26 hours ago)

 100.0% packages/intrepid2/core/src/Shared/MiniTensor/

8d1e415 "Intrepid2 MiniTensor: Reworked linear solver and inverse algorithms to improve performance and generalize interface so that the linear solver accepts multiple RHSs in vect
Author: Alejandro Mota <amota@sandia.gov>
Date:   Wed Nov 30 14:36:44 2016 -0800 (26 hours ago)

  87.8% packages/intrepid2/core/src/Shared/MiniTensor/
  12.1% packages/intrepid2/core/test/Shared/MiniTensor/

937a154 "Intrepid2 MiniTensor: Improve performance of direct linear solver and introduce infrastructure for preconditioners. Add two simple instances of preconditioners for use in 
Author: Alejandro Mota <amota@sandia.gov>
Date:   Tue Nov 29 18:32:28 2016 -0800 (26 hours ago)

  96.7% packages/intrepid2/core/src/Shared/MiniTensor/
   3.2% packages/intrepid2/core/test/Shared/MiniTensor/

Please can an ROL developer confirm this failure using the checkin-test-sems.sh script. This is as easy as:

$ cd Trilinos/
$ mkdir CHECKIN/  # Or put this anywhere you want on Linux systems
$ ln -s ../cmake/std/sems/checkin-test-sems.sh .
$ ./checkin-test-sems.sh --enable-all-packages=off --no-enable-fwd-packages \
  --enable-packages=ROL --local-do-all
bartlettroscoe commented 7 years ago

We need to either back out the breaking commits or disable ROL in pre-push CI testing. Having it fail is just going to stop other developers from being able to push using the checkin-test-sems.sh script.

dridzal commented 7 years ago

What is the breaking commit?


From: Roscoe A. Bartlett notifications@github.com Sent: Friday, December 2, 2016 16:29 To: trilinos/Trilinos Cc: Ridzal, Denis; Team mention Subject: [EXTERNAL] Re: [trilinos/Trilinos] ROL MiniTensor examples/tests fail to build since 12/1/2016 push (#899)

We need to either back out the breaking commits or disable ROL in pre-push CI testing. Having it fail is just going to stop other developers from being able to push using the checkin-test-sems.sh script.

- You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/899#issuecomment-264590451, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APE1ClYkJzgh3ypuscPhjaN61_CCRcvHks5rEKnKgaJpZM4LDGQO.

bartlettroscoe commented 7 years ago

What is the breaking commit?

See above comment.

Is the ROL team getting CDash emails for these CI failures?

dridzal commented 7 years ago

We are getting some of the emails. Unfortunately, they are combined with timeout emails that we've been ignoring.

We haven't caught this because Boost needs to be built to enable this feature, and we don't use Boost. Also, this is an upstream dependency issue. There should really be a test for this in Intrepid2, i.e., ROL should not be the package to catch these problems.

Alejandro Mota, @lxmota , can probably resolve this fairly quickly, however, he may be out of the office. If we need this fixed today, please back out the commits, and let Alejandro know.

dridzal commented 7 years ago

Also, can we confirm that Intrepid2 is not failing? Is there no test for this feature in Intrepid2? @mperego ?

lxmota commented 7 years ago

The MiniTensor tests pass. I just looked at the error. It says it's an internal compiler error.

Not sure what to do in this instance.

dridzal commented 7 years ago

I'm not sure either. @bartlettroscoe , how do we start debugging internal compiler errors? I guess we would need that particular compiler first ...

lxmota commented 7 years ago

In the meantime I'll remove the offending line and substitute it with something that hopefully won't choke that particular compiler.

bartlettroscoe commented 7 years ago

how do we start debugging internal compiler errors? I guess we would need that particular compiler first ...

I am afraid we are stuck with GCC 4.7.2 for the CI build for now because that is the default compiler for a very important Trilinos customer that I can't name here. But there is hope that they will be moving to GCC 4.9.x very soon this may get better soon.

bartlettroscoe commented 7 years ago

We are getting some of the emails. Unfortunately, they are combined with timeout emails that we've been ignoring.

Then you need to create Issues for these and disable them until then can be fixed (or the timeouts increased).

We haven't caught this because Boost needs to be built to enable this feature, and we don't use Boost. Also, this is an upstream dependency issue. There should really be a test for this in Intrepid2, i.e., ROL should not be the package to catch these problems.

STK needs Boost and a very important customer that I can't name here needs STK, MiniTensor, and ROL. Therefore, we need to turn Boost on whenver we test before we push to Trilinos.

Can the ROL developers working on Linux please try using the [checkin-test-sems.sh] script to push Trilinos? It just requires the one-time setup:

$ cd Trilinos/
$ mkdir CHECKIN/  # Or put this anywhere you want on Linux systems
$ ln -s ../cmake/std/sems/checkin-test-sems.sh .

then whenever you want to push, you just do:

$ cd Trilinos/CHECKIN/
$ ./checkin-test-sems.sh --do-all --push

Since ROL is pretty far down in the dependency chain, you will only be building ROL tests and downstream packages that depend on ROL.

Can you guys please give this a try?

In the meantime I'll remove the offending line and substitute it with something that hopefully won't choke that particular compiler.

@lxmota, do you have access and time on a SNL RHEL 6 COE machine with the SEMS env mounted? If so, you can use the checkin-test-sem.sh script to safely push in the future. If not, I can give you an account on my SRN machine crf450 to allow you to push for now (it looks like you don't push all that often and the default number of cores used is just 4).

lxmota commented 7 years ago

I normally use the checkin script to push, but because of the Intrepid2 refactor, there are some build errors with MueLu unless Intrepid2_KokkosDynamicView is enable. But that won't build the MiniTensor tests.

I don't have access to a SNL RHEL 6 COE machine. Since I'm not in 1400 we have to pay for that and it's expensive.

bartlettroscoe commented 7 years ago

I don't have access to a SNL RHEL 6 COE machine. Since I'm not in 1400 we have to pay for that and it's expensive.

@lxmota, then if you have access to the SRN, I will give you an account on my SRN machine crf450. If you only use 4 cores to build and push (and don't mind waiting a while for the build and tests to run) then it would not really impact my usage of that machine too much. Let me know if you are interested in this. We can set up a checkin-test driver that will do this in one shot from your SRN machine at SNL/CA.

mperego commented 7 years ago

I normally use the checkin script to push, but because of the Intrepid2 refactor, there are some build errors with MueLu unless Intrepid2_KokkosDynamicView is enable. But that won't build the MiniTensor tests

Is this only a CMake issue? I do not see why the test should not build when Intrepid2_KokkosDynamicView is enabled.

mperego commented 7 years ago

There is a missing link to the MiniTensor test directory in one of the CMakeLists.txt files.. I'll try to fix this.

dridzal commented 7 years ago

The ROL team will start using the checkin scripts soon. We have our own methods that take into account a variety of environments and compilers. Note that the culprit here is code in Intrepid2, so we really couldn't have done much to prevent this. But we could have diagnosed it.

These issues would be MUCH simpler to resolve if most Trilinos developers used a common development environment, possibly with several agreed-upon compiler options. I think that we could learn some lessons from our customers here.

lxmota commented 7 years ago

@bartlettroscoe Yes, I'm on the SRN. If you give me an account on your machine, I'll build, test and attempt to push from there.

mhoemmen commented 7 years ago

how do we start debugging internal compiler errors?

GCC 4.7.2 gives me some troubles with lambdas in Kokkos::parallel_* loops. I've been avoiding lambdas in Tpetra; it seems to help with that compiler.

mperego commented 7 years ago

@lxmota, I have enabled MiniTensor tests when Intrepid2_KokkosDynamicView is enabled (9d283db7578cd6)

lxmota commented 7 years ago

@mperego Thanks, MiniTensor tests are now enabled.

@dridzal But there is a timeout in one of ROL's tests that prevents me from pushing:
ROL_example_burgers-control_example_07 (Timeout)

dridzal commented 7 years ago

This is a known issue. You can either ignore the timeout and push manually, or increase the timeout time. I don't know how to do the latter ... @bartlettroscoe ?

lxmota commented 7 years ago

Ok, in the interest of fixing the internal compile error I'll go ahead and push manually.

mhoemmen commented 7 years ago

@dridzal wrote:

... or increase the timeout time. I don't know how to do the latter ...

Add the following command-line argument to the check-in test script, replacing 180 with the number of seconds that you want for the time-out:

--ctest-timeout=180

bartlettroscoe commented 7 years ago

You can either ignore the timeout and push manually, or increase the timeout time. I don't know how to do the latter ...

@lxmota, if you are getting timeouts locally with the checkin-test.py script, then you can override the --ctest-timeout=180 arg in your local-checkin-test-defaults.py file (a default gets written the first time you run checkin-test-sems.sh or you can just pass in --ctest-timeout=300 or whatever when your run it).

Unfortunately, they are combined with timeout emails that we've been ignoring.

@dridzal, we need to make these timeouts go away:

so that ROL developers don't have to keep ignoring (or filtering) CDash error emails. We can either disable these tests for these specific builds by setting -D<testname>_DISABLE=ON in the CTest driver script for that build (I can show you what scripts need to be modified), or we can just increase the timeout for individual tests by setting the TIMEOUT <seconds> argument for TRIBITS_ADD_TEST():

The CTest driver script can also scale all of the timeouts up for a slow machine. See:

We cannot tolerate perpetually failing CI and Nighty tests because they just teach people to ignore CDash emails (which is what has seemed to have happened with ROL developers). We need to get all of this cleaned up. I can help you do some of that.

@jwillenbring and @bmpersc, I will CC you in on those emails and Issues so that you know how to do these types of tweaks as well to help clean up the CDash builds and tests.

bartlettroscoe commented 7 years ago

The MiniTensor tests pass. I just looked at the error. It says it's an internal compiler error.

The build problem is not just with GCC 4.7.3, it also occurs with GCC 4.8.3 as well as shown at:

Don't know more because ROL is only Nighty tested on one machine (muir.sandia.gov) and only with GCC 4.7.2 and GCC 4.8.4 :-(

dridzal commented 7 years ago

Ross, I agree with you. But let me clarify what's going on with those tests. A few weeks ago, we cleaned up everything in ROL, and received no CDash errors for several days. Then, with minimal changes to ROL (that should have affected nothing), we started seeing these timeouts. OK, then we went and looked in the runtime history for these tests, and guess what -- it does not exist! Or at least it was reset in CDash somehow. Strange. So, how are we supposed to actually fix the problem? Disabling tests is not a solution. Another note: In most cases, ROL tests fail because we are finding problems with other packages. For some of the test timeouts in question, I fear that there is a strange slowdown in Teuchos LAPACK, because we made no changes to the tests, and now they are running excruciatingly slow. The main computational cost in those tests is due to Teuchos LAPACK. For the build failures yesterday, it was Intrepid2. A few months ago, some tests were failing because of Amesos2. We fixed those. At the same time, we ran into performance issues with MueLu and were forced to change or disable tests (these tests are still not using MueLu because of unresolved performance issues). Anyway, my point is that for a package like ROL, which exercises the full linear algebra / solver stack, a failure that we can't resolve quickly almost always points to a deeper issue upstream. If the issues are due to timeouts, then we need a strategy to go back to the last well-performing build, and start digging. Please advise.

bartlettroscoe commented 7 years ago

As of this CI iteration on Dec 03, 2016 - 00:41 UTC:

this appears to be fixed.

Closing as complete.