trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

TpetraCore_cudaMpi_CanaryTest_MPI_2 tests needs to check Behavior, not just CMake option (ATDM Trilinos 'vortex' builds) #6796

Closed mhoemmen closed 4 years ago

mhoemmen commented 4 years ago

Bug Report

@trilinos/tpetra

Description

The cudaMpi_CanaryTest test in tpetra/core/test/Comm builds and runs based only on the Tpetra_ASSUME_CUDA_AWARE_MPI CMake option. However, the CMake option only sets Tpetra's default behavior. Users (and the Dashboard tests) can always override that behavior by setting the TPETRA_ASSUME_CUDA_AWARE_MPI environment variable.

Steps to Reproduce

See instructions for building and running tests on ATS-2 machine 'vortex':

  1. Successfully finish building TpetraCore's tests with CUDA enabled and the CMake option Tpetra_ASSUME_CUDA_AWARE_MPI explicitly disabled (set to OFF). (It often gets enabled by default, since Tpetra attempts to query ompi_info if it exists.)

  2. Set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI=1, and run cudaMPI_CanaryTest. It should fail, since the test doesn't check Tpetra::Details::Behavior.

Thanks to @bartlettroscoe and @e10harvey for pointing this out : - ) .

bartlettroscoe commented 4 years ago

@mhoemmen,

To make this concrete, as shown in this query, the test:

is failing in the builds:

As shown in this query, the errors show:

[vortex2:26461] *** Process received signal ***
[vortex2:26461] Signal: Segmentation fault (11)
[vortex2:26461] Signal code: Address not mapped (1)
[vortex2:26461] Failing at address: 0x200073200000
<< Rank 0: Generating lwcore_cpu.79271_4.0 on vortex2 Tue Feb 18 05:15:29 MST 2020 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >>
<< Rank 0:  Generated lwcore_cpu.79271_4.0 on vortex2 Tue Feb 18 05:15:29 MST 2020 in 0 secs >>
<< Rank 0: Waiting 60 secs before aborting task on vortex2 Tue Feb 18 05:15:29 MST 2020 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >>
<< Rank 0:  Waited 60 secs -> now aborting task on vortex2 Tue Feb 18 05:16:29 MST 2020 (LLNL_COREDUMP_KILL=task) >>
[vortex2:26461] -----------------------
[vortex2:26461] -----------------------
[vortex2:26461] *** End of error message ***
mhoemmen commented 4 years ago

@trilinos/tpetra I probably won't be able to fix this today or tomorrow; could someone else please fix this?

bartlettroscoe commented 4 years ago

No big rush to fix. I have added entires for this to the TrilinsoATDMStatus/*.csv file and these failures are being sorted to the bottom of the list.

kddevin commented 4 years ago

I will take a look.

I don't understand @mhoemmen's scenario. If CMake variable Tpetra_ENABLE_CUDA_AWARE_MPI=OFF, the test will not build. Setting the environment variable to ON should not change that behavior; the test will still not build and run (even though maybe it should have). So it cannot fail in this case.

I also do not understand all of @bartlettroscoe 's cases. Looking at the cdash page he cites, I see that test Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg infers that CUDA_AWARE_MPI is being used and that the Canary test is built (and then fails). (Grep for Canary at https://testing-dev.sandia.gov/cdash/buildSummary.php?buildid=5228779 .) Is the environment variable changed separately from CMake in this test? If so, where is that reported on the dashboard?

More strangely, the cmake output for test Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi does not report that the Canary test is being built. (Grep for Canary at https://testing-dev.sandia.gov/cdash/buildSummary.php?buildid=5235053 produces no matches.) I do not understand why the Canary test is running for this configuration.

kddevin commented 4 years ago

@mhoemmen Why does Tpetra have both Details::assumeMpiIsCudaAware and Details::Behavior::assumeMpiIsCudaAware?

kddevin commented 4 years ago

Fix is in https://github.com/trilinos/Trilinos/tree/tpetra_6796 I would like to understand the test environment questions above (and complete a bit more testing) before PRing the fix.

bartlettroscoe commented 4 years ago

I would like to understand the test environment questions above (and complete a bit more testing) before PRing the fix.

@kddevin, most of this has been driven by the advice of @jjellio and @mhoemmen. They would know the details.

bartlettroscoe commented 4 years ago

More strangely, the cmake output for test Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi does not report that the Canary test is being built. (Grep for Canary at https://testing-dev.sandia.gov/cdash/buildSummary.php?buildid=5235053 produces no matches.) I do not understand why the Canary test is running for this configuration.

It is right here, no? CDash is kind of hard to navigate if you are not used to it. Kitware could improve this a lot help people understand what view a person is looking at when they look at a given page.

jjellio commented 4 years ago

With the new ATS-2 testing, the framework stuff is going to evaluate both Tpetra_ENABLE_CUDA_AWARE_MPI=0 and Tpetra_ENABLE_CUDA_AWARE_MPI=1

Those should be showing up as different build id's in CDash.

Specifically: The testing tool will build trilinos however Vortex reports cuda-aware MPI (currently defaults to true). That is, it will set the default value for Tpetra's variable to true.

When it runs the test suite, it will now make two passes. It will explicitly set Tpetra_ENABLE_CUDA_AWARE_MPI=0; and run the tests and report those results to the regular build id for vortex.

It will then set Tpetra_ENABLE_CUDA_AWARE_MPI=1, and rerun all tests... then it report those results to something with 'cuda-aware' in the name.

mhoemmen commented 4 years ago

@kddevin wrote:

If CMake variable Tpetra_ENABLE_CUDA_AWARE_MPI=OFF, the test will not build. Setting the environment variable to ON should not change that behavior; the test will still not build and run (even though maybe it should have). So it cannot fail in this case.

The problem is if the CMake variable Tpetra_ENABLE_CUDA_AWARE_MPI is set to ON, but the environment variable TPETRA_ENABLE_CUDA_AWARE_MPI is set to OFF. This would mean that the test builds and runs, and (incorrectly) assumes that MPI is CUDA aware. The fix is for the test also to check the environment variable at run time. If we do that in the test, then it's harmless for the test to build in any build with CUDA and MPI both enabled.

mhoemmen commented 4 years ago

@kddevin wrote:

Why does Tpetra have both Details::assumeMpiIsCudaAware and Details::Behavior::assumeMpiIsCudaAware?

It shouldn't ;-) . Tpetra should get these sorts of things from Details::Behavior.

jjellio commented 4 years ago

On Vortex currently, ompi_info is reporting that it is cuda-aware, so the CMake variable is getting set to true. Tpetra_ENABLE_CUDA_AWARE_MPI=true.

The testing framework will then attempt to run all tests with TPETRA_ENABLE_CUDA_AWARE_MPI=0 and TPETRA_ENABLE_CUDA_AWARE_MPI=1.

So it sounds like this test would need to be marked 'expected failure' for the non-cuda-aware build. But this is an interesting case. When I put together the JSRUN wrapper I never thought of unit tests that could be intended only for cuda-aware.

The traditional Trilinos way of doing things would have been to do two full builds. One with Cmake's Tpetra_ENABLE_CUDA_AWARE_MPI=false and one with Tpetra_ENABLE_CUDA_AWARE_MPI=true. But since cuda-aware is just a runtime option, it would be extremely inefficient to do that, so we (Ross, Evan, and myself) cooked up the system in place, which uses the same build to run the unit tests twice (Saves having to do another build).

kddevin commented 4 years ago

Yes, I can and have already begun modifying the test to check the environment variable. But changing the environment variable separately from the CMake configuration is a risky idea. For example,

It is right here, no? CDash is kind of hard to navigate if you are not used to it. Kitware could improve this a lot help people understand what view a person is looking at when they look at a given page.

@bartlettroscoe Actually, no, it is not there. Yes, your link shows the test failure. But no, the build page that is associated with your link (following the "Build" link on your page: https://testing-dev.sandia.gov/cdash/buildSummary.php?buildid=5235053) does not show the Canary test being built, so it should not have been run in this test. @jjellio implies we are reusing a build with a different environment in this test. That means the data on CDash is inaccurate.

Using a single build with different environment variables is flawed for several reasons.

The environment variables are meant for developer convenience, but they have the potential to create inconsistencies. For testing, we should have better archiving and more precise control of the environment. I'd vote for paying the price of an additional build with clear configuration options.

jjellio commented 4 years ago

Karen, the Tpetra ENV only controls the default behavior. BOTH code paths are compiled regardless - that is why you want to do it this way. Because at runtime ATDM is evaluating both ENV=1 and ENV=0. All of this is runtime changes to MPI, not compile time.

The pain point on the SNL testbeds (and on LLNL machines a while back) has been that different builds were setting different defaults, but then at runtime we ended up with a different default runtime setting.

In some cases, you ended up with weird segfaults (the libraries default was set to true, but runtime was set to false). In other cases you end up with unexpected performance issues.

Compiling one way or the other won't save you, it will only change the default that Tpetra hardcodes. The point of the ATS-2 testing methodology is to make sure that we test the case with 'default = off' and 'runtime = off' and 'default = on' and 'runtime = on' (referring to cuda-awareness).

EDIT: To add, you can see the state of the Tpetra ENV:

Look at the test output: (look at the first three lines). I recently added this exact feature because developers need to know what the ENV was. Simply load the ATS2 ENV, then copy/paste the 'AFTER: ' line.

(The AFTER line showing the full jsrun command + ENV)

export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'

Full output from CDash here:

WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
BEFORE: jsrun  '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
out_file=4dd1a321b4bcc5c1c294a7bea8279523.out
Warning: PAMI CUDA HOOK disabled
kddevin commented 4 years ago

@jjellio I don't see this kind of info anywhere on the CDash page that @bartlettroscoe referenced above. The command line there is

/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/Trilinos/cmake/std/atdm/ats2/trilinos_jsrun "-p" "2" "--rs_per_socket" "4" "/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Comm/TpetraCore_cudaMpi_CanaryTest.exe" 

Nothing about environment variables.

The CDash pages seem to be in flux. Now the Build link off @bartlettroscoe 's page doesn't show any build information, so a person trying to diagnose a failure has no information. I guess that's better than showing incorrect information, but not much. Sigh.

I'll finish testing my changes and PR them. I'll also stick with my claim that if we're changing environment variables while testing, we cannot use the equivalent CMake option in any CMakeLists.txt, for fear of inconsistencies. And I'll argue that everything about a run (the build, the environment, etc.) needs to be archived with the test results for reproducibility.

jjellio commented 4 years ago

The build in Ross's post was from Feb 13th. The jsrun modification from me may not have made into Develop then, there was a problem on Vortex around Feb 13-14th that blocked Ross from testing my change.

Anything posted this week should have the clear output showing what jsrun did.

mhoemmen commented 4 years ago

@kddevin wrote:

Allowing both CMake options and environment variables to control the same features can lead to inconsistent build-vs-run environments, as we have seen here.

I'd like to make sure we understand the reason why we have the environment variable.

  1. Tpetra attempts to detect this option at configure time, because users should get reasonable defaults. Every CMake option that users need to hard-code adds technical debt to applications' build systems.
  2. Tpetra has an option to override the configure-time choice via environment variable, because of cross compilation. For instance, ompi_info might give the wrong answer at configure time, because it's a different OpenMPI binary than you plan to use at run time.

If I had to choose between one or the other, then I would prefer to get rid of the CMake option and keep the environment variable. This is also valuable because even if MPI knows what to do with CUDA buffers, that doesn't necessarily mean that it's faster. We need to be able to do benchmarks with both options, and it's a lot easier to do that if we don't have to rebuild.

jjellio commented 4 years ago

Please keep the runtime option... this is a runtime option to MPI. There is no reason to build twice (since Tpetra can do choose at initialization time what path it will follow). We (the apps) are testing with -M -gpu and without... we want the best performance.

kddevin commented 4 years ago

I would be fine with removing the CMake option and any CMake variable that one could use in CMakeLists.txt to control build behavior. That would prevent running in a state that contradicts the build, as was happening with this test.

This pattern would hold for any CMake option that is also an environment variable option, not only for Tpetra_ASSUME_CUDA_AWARE_MPI.

bartlettroscoe commented 4 years ago

FYI: As shown in this query, and this query (click "Show Matching Output" in upper right to see the failures) it looks like the test:

has a similar issue the 'vortex' builds:

showing the error:

0. Behavior_Default_UnitTest ... 
 dbg==debug_default = 1 == true : passed
 verb==verbose_default = 1 == true : passed
 cuda_aware_mpi==cuda_aware_mpi_default = 0 == true : FAILED ==> /vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Behavior/Behavior_Default.cpp:82
 [FAILED]  (8.02e-05 sec) Behavior_Default_UnitTest
 Location: /vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Behavior/Behavior_Default.cpp:62

So I will lump this test failure under this issue as well

kddevin commented 4 years ago

The Behavior_Default test is a different issue, though caused by the same risky design of allowing configuration parameters and environment variables to differ. I will create a new issue for it.

bartlettroscoe commented 4 years ago

But no, the build page that is associated with your link (following the "Build" link on your page: https://testing-dev.sandia.gov/cdash/buildSummary.php?buildid=5235053) does not show the Canary test being built, so it should not have been run in this test.

@kddevin, just to make sure our terminology is clear, CDash does not show any record of any target getting built on CDash unless there are build warnings or errors. By "does not show the Canary test being built" I will assume you mean the configure output showing what tests are being added or not? Otherwise, what do you mean? As for showing the configure output, we can redo the configure and build (which will be very fast because nothing changes from the prior non-CUDA aware build) so you will see that info on CDash.

kddevin commented 4 years ago

Done

bartlettroscoe commented 4 years ago

@kddevin, the policy is to leave open until we have confirmation on CDash that the tests are fixed. Otherwise, what is the process to follow up to make sure that things are resolved?

Actually, Reed M. is going to be working on a process to automate this follow-up process some as per #3887.

bartlettroscoe commented 4 years ago

As shown in this query, the test TpetraCore_cudaMpi_CanaryTest_MPI_2 was failing every day in the 'ats2' builds and now pass in every build today with results shown in:

Site Build Name Test Name Status Time Proc Time Details Build Time Processors
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 540ms 1s 80ms Completed 2020-02-26T02:07:36 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 600ms 1s 200ms Completed 2020-02-25T02:07:51 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 1s 630ms 2m 3s 260ms Completed (Failed) 2020-02-24T02:07:06 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 110ms 2m 4s 220ms Completed (Failed) 2020-02-23T02:07:08 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 1s 620ms 2m 3s 240ms Completed (Failed) 2020-02-22T02:07:08 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 2s 10ms 4s 20ms Completed 2020-02-26T07:30:51 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 100ms 2m 4s 200ms Completed (Failed) 2020-02-24T05:28:32 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 170ms 2m 4s 340ms Completed (Failed) 2020-02-23T07:20:45 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 110ms 2m 4s 220ms Completed (Failed) 2020-02-22T05:29:54 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 700ms 1s 400ms Completed 2020-02-25T02:03:40 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 1s 880ms 2m 3s 760ms Completed (Failed) 2020-02-24T02:02:44 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 80ms 2m 4s 160ms Completed (Failed) 2020-02-23T02:02:45 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 210ms 2m 4s 420ms Completed (Failed) 2020-02-22T02:03:51 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 1s 30ms 2s 60ms Completed 2020-02-26T07:56:51 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Passed 560ms 1s 120ms Completed 2020-02-25T06:02:04 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 1s 760ms 2m 3s 520ms Completed (Failed) 2020-02-24T05:27:53 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 2s 50ms 2m 4s 100ms Completed (Failed) 2020-02-23T05:20:52 MST 2
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi TpetraCore_cudaMpi_CanaryTest_MPI_2 Failed 1m 1s 650ms 2m 3s 300ms Completed (Failed) 2020-02-22T05:29:05 MST 2

Closing as fixed.

Thanks @kddevin!