trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

Segfault in Tpetra's writeDenseFile() #90

Closed dridzal closed 8 years ago

dridzal commented 8 years ago

@trilinos/tpetra: The @trilinos/rol team is using MatrixMarket writers for the PDE-OPT application development kit. It appears that a recent change to Tpetra is causing segaults in writeDenseFile(), in an MPI build, for example in the data.hpp file in /rol/example/PDE-OPT/poisson:

void outputTpetraVector(const Teuchos::RCP<const Tpetra::MultiVector<> > &vec, const std::string &filename) const { Tpetra::MatrixMarket::WriterTpetra::MultiVector< > vecWriter; vecWriter.writeDenseFile(filename, vec); }

We are not sure which recent change in Tpetra caused the issue; it was a day or two after the ROL commit d5bcec2467f9087f0117ae7da1db921b3afc4d6d. If you checkout this sha1, the example in the above directory will run without segfaults, in a parallel MPI build. The issue is not present in a serial build. We have not checked CrsMatrix writers.

mhoemmen commented 8 years ago

Hi Denis! Could you explain here exactly what test to run and how to run it? Does it need any TPLs or optional packages?

mhoemmen commented 8 years ago

Is this the "ROL_example_PDE-OPT_stefan-boltzmann_example_02" that's currently failing on the Dashboard? If so, it looks like the example only requires Intrepid, Epetra, Amesos, Amesos2, and Tpetra, per rol/example/PDE-OPT/stefan-boltzmann/CMakeLists.txt, lines 3-7.

mhoemmen commented 8 years ago

I was able to run the example in question. I get a bunch of normal-looking output, ending with the following, and then it hangs.

2: Check ScalarLinearEqualityConstraint 2: Step size norm(Jac*vec) norm(FD approx) norm(abs error) 2: --------- ------------- --------------- --------------- 2: 1.00000000000e+00 6.74831509765e+00 6.74831509765e+00 1.77635683940e-15 2: 1.00000000000e-01 6.74831509765e+00 6.74831509765e+00 1.77635683940e-15 2: 1.00000000000e-02 6.74831509765e+00 6.74831509765e+00 2.75335310107e-14 2: 1.00000000000e-03 6.74831509765e+00 6.74831509765e+00 1.59872115546e-14 2: 1.00000000000e-04 6.74831509765e+00 6.74831509765e+00 7.93143328792e-13 2: 1.00000000000e-05 6.74831509765e+00 6.74831509767e+00 2.41069386675e-11 2: 1.00000000000e-06 6.74831509765e+00 6.74831509773e+00 7.96189780772e-11 2: 1.00000000000e-07 6.74831509765e+00 6.74831509673e+00 9.19581744085e-10 2: 1.00000000000e-08 6.74831509765e+00 6.74831511782e+00 2.01746557238e-08 2: 1.00000000000e-09 6.74831509765e+00 6.74831512892e+00 3.12768850819e-08 2: 1.00000000000e-10 6.74831509765e+00 6.74831635017e+00 1.25252221306e-06 2: 1.00000000000e-11 6.74831509765e+00 6.74832412173e+00 9.02408338543e-06 2: 1.00000000000e-12 6.74831509765e+00 6.74826861058e+00 4.64870678458e-05 2: 2: Test Consistency of Jacobian and its adjoint: 2: |<w,Jv> - <adj(J)w,v>| = 0.00000000e+00 2: |<w,Jv>| = 6.74831510e+00 2: Relative Error = 0.00000000e+00 2: 2: Augmented Lagrangian solver 2: iter fval cnorm gLnorm snorm penalty feasTol optTol #fval #grad #cval subIter 2: 0 3.418638e-04 1.110223e-16 4.200321e-04 1.00e+01 1.26e-01 4.20e-06

mhoemmen commented 8 years ago

Would you mind terribly much adding the following to the CMakeLists.txt file for the test, and rerunning?

ARGS "--globally-reduce-test-result --output-show-proc-rank --output-to-root-rank-only=-1"

Please see tpetra/core/test/ImportExport/CMakeLists.txt for examples. This ensures that all MPI procesess get a chance to print. That way, we might be able to catch some exception throws. Also, please try a debug build with Teuchos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG ON.

mhoemmen commented 8 years ago

Would you mind terribly much adding the following to the CMakeLists.txt file for the test, and rerunning?

ARGS "--globally-reduce-test-result --output-show-proc-rank --output-to-root-rank-only=-1"

Please see tpetra/core/test/ImportExport/CMakeLists.txt for examples. This ensures that all MPI procesess get a chance to print. That way, we might be able to catch some exception throws. Also, please try a debug build with Teuchos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG ON.

Oh wait, oops, the ARGS thing won't work, because it's a stand-alone executable and doesn't use the Teuchos unit test framework.

dridzal commented 8 years ago

I believe this is the wrong example (although it may have a similar issue once it finishes --we just managed to introduce a timeout test failure). A better example to run is in /example/PDE-OPT/poisson. It requires the same Trilinos dependencies, which you can get by enabling all ROL dependencies.

The really strange thing is that it works fine if I use a clang compiler and MPI 1.8.7. It fails with gnu and the same MPI.


From: Mark Hoemmen notifications@github.com Sent: Friday, January 22, 2016 7:34 To: trilinos/Trilinos Cc: Ridzal, Denis Subject: [EXTERNAL] Re: [Trilinos] Segfault in Tpetra's writeDenseFile() (#90)

I was able to run the example in question. I get a bunch of normal-looking output, ending with the following, and then it hangs.

2: Check ScalarLinearEqualityConstraint 2: Step size norm(Jac*vec) norm(FD approx) norm(abs error) 2: --------- ------------- --------------- --------------- 2: 1.00000000000e+00 6.74831509765e+00 6.74831509765e+00 1.77635683940e-15 2: 1.00000000000e-01 6.74831509765e+00 6.74831509765e+00 1.77635683940e-15 2: 1.00000000000e-02 6.74831509765e+00 6.74831509765e+00 2.75335310107e-14 2: 1.00000000000e-03 6.74831509765e+00 6.74831509765e+00 1.59872115546e-14 2: 1.00000000000e-04 6.74831509765e+00 6.74831509765e+00 7.93143328792e-13 2: 1.00000000000e-05 6.74831509765e+00 6.74831509767e+00 2.41069386675e-11 2: 1.00000000000e-06 6.74831509765e+00 6.74831509773e+00 7.96189780772e-11 2: 1.00000000000e-07 6.74831509765e+00 6.74831509673e+00 9.19581744085e-10 2: 1.00000000000e-08 6.74831509765e+00 6.74831511782e+00 2.01746557238e-08 2: 1.00000000000e-09 6.74831509765e+00 6.74831512892e+00 3.12768850819e-08 2: 1.00000000000e-10 6.74831509765e+00 6.74831635017e+00 1.25252221306e-06 2: 1.00000000000e-11 6.74831509765e+00 6.74832412173e+00 9.02408338543e-06 2: 1.00000000000e-12 6.74831509765e+00 6.74826861058e+00 4.64870678458e-05 2: 2: Test Consistency of Jacobian and its adjoint: 2: | - | = 0.00000000e+00 2: || = 6.74831510e+00 2: Relative Error = 0.00000000e+00 2: 2: Augmented Lagrangian solver 2: iter fval cnorm gLnorm snorm penalty feasTol optTol #fval #grad #cval subIter 2: 0 3.418638e-04 1.110223e-16 4.200321e-04 1.00e+01 1.26e-01 4.20e-06

Reply to this email directly or view it on GitHubhttps://github.com/trilinos/Trilinos/issues/90#issuecomment-173936292.

dridzal commented 8 years ago

We were able to generate the error by using -np 1, i.e., a single processor, so there should be no need for the args. In my previous email, I should have stated that I build with

-D Trilinos_ENABLE_ROL:BOOL=ON \

-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON \

-D Trilinos_ENABLE_TESTS:BOOL=OFF \

-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \

-D ROL_ENABLE_TESTS:BOOL=ON \

-D ROL_ENABLE_EXAMPLES:BOOL=ON \

This is the easiest way to build the examples in question. The best example is contained in

/Trilinos/packages/rol/example/PDE-OPT/poisson?


From: Mark Hoemmen notifications@github.com Sent: Friday, January 22, 2016 7:39 To: trilinos/Trilinos Cc: Ridzal, Denis Subject: [EXTERNAL] Re: [Trilinos] Segfault in Tpetra's writeDenseFile() (#90)

Would you mind terribly much adding the following to the CMakeLists.txt file for the test, and rerunning?

ARGS "--globally-reduce-test-result --output-show-proc-rank --output-to-root-rank-only=-1"

Please see tpetra/core/test/ImportExport/CMakeLists.txt for examples. This ensures that all MPI procesess get a chance to print. That way, we might be able to catch some exception throws. Also, please try a debug build with Teuchos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG ON.

Oh wait, oops, the ARGS thing won't work, because it's a stand-alone executable and doesn't use the Teuchos unit test framework.

Reply to this email directly or view it on GitHubhttps://github.com/trilinos/Trilinos/issues/90#issuecomment-173937392.

mhoemmen commented 8 years ago

The really strange thing is that it works fine if I use a clang compiler and MPI 1.8.7. It fails with gnu and the same MPI.

I am using Clang 3.7 and OpenMPI 1.10.1, and the Poisson test you mentioned passes for me. I ran the executable with 1, 2, 3, and 4 MPI processes, with the same result (test passes). I set Teuchos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG = ON, so it should be able to catch funny array stuff.

I looked on ROL's Dashboard, but the only test that's failing is ROL_example_PDE-OPT_stefan-boltzmann_example_02. It times out on one of the Dashboard machines (Muir). The Poisson test passes on all the other Dashboard machines, including some that run GCC. For example, here is a build that runs GCC 4.8.2, and the test is listed as having passed:

http://testing.sandia.gov/cdash/viewTest.php?onlypassed&buildid=2311782

I totally believe you btw :-) It's just that this makes it hard for me to replicate the issue you observed. Could you post some more details about your build? What version of GCC was it, and on what OS? Was it a debug or release build? Could you post your whole configuration script here?

dridzal commented 8 years ago

Well, now it works with the 4.8.2 and 4.8.4 compilers (I'm using the standard SEMS-provided compiler modules, with SEMS-provided Open MPI 1.8.7). And yes, the CDash errors are gone too. Just so you know that I'm not entirely crazy, here is the CDash failure that started it all:

http://testing.sandia.gov/cdash/testDetails.php?test=32403244&build=2308893

This was with 4.8.2. So, something must have gotten fixed somewhere (Kokkos, Tpetra, who knows). We haven't touched any ROL code that pertains to this previously failing test. Should we attribute the issue to Trilinos gremlins (Trimlins?) and close it?


From: Mark Hoemmen notifications@github.com Sent: Friday, January 22, 2016 23:23 To: trilinos/Trilinos Cc: Ridzal, Denis Subject: [EXTERNAL] Re: [Trilinos] Segfault in Tpetra's writeDenseFile() (#90)

The really strange thing is that it works fine if I use a clang compiler and MPI 1.8.7. It fails with gnu and the same MPI.

I am using Clang 3.7 and OpenMPI 1.10.1, and the Poisson test you mentioned passes for me. I ran the executable with 1, 2, 3, and 4 MPI processes, with the same result (test passes). I set Teuchos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG = ON, so it should be able to catch funny array stuff.

I looked on ROL's Dashboard, but the only test that's failing is ROL_example_PDE-OPT_stefan-boltzmann_example_02. It times out on one of the Dashboard machines (Muir). The Poisson test passes on all the other Dashboard machines, including some that run GCC. For example, here is a build that runs GCC 4.8.2, and the test is listed as having passed:

http://testing.sandia.gov/cdash/viewTest.php?onlypassed&buildid=2311782

I totally believe you btw :-) It's just that this makes it hard for me to replicate the issue you observed. Could you post some more details about your build? What version of GCC was it, and on what OS? Was it a debug or release build? Could you post your whole configuration script here?

Reply to this email directly or view it on GitHubhttps://github.com/trilinos/Trilinos/issues/90#issuecomment-174153578.

mhoemmen commented 8 years ago

This was with 4.8.2. So, something must have gotten fixed somewhere (Kokkos, Tpetra, who knows). We haven't touched any ROL code that pertains to this previously failing test.

It could be that funny MultiVector::getStride thing that was broken for a couple days a week or two ago, which somebody quickly found and fixed. Worked on Clang, didn't work elsewhere (that's how it got through my check-in tests).

Should we attribute the issue to Trilinos gremlins (Trimlins?) and close it?

"Trimlins" -- I like that ;-P Somebody needs to cartoon up some specimens. I'll close it -- thanks!