trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

TpetraCore_BlockCrsMatrix_MPI_4 failing in ATDM cuda builds #4257

Closed fryeguy52 closed 5 years ago

fryeguy52 commented 5 years ago

CC: @trilinos/tpetra, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4 seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. Next: Get PR #4326 merged which re-enables this test in the Trilinos CUDA PR build ...

Description

As shown in this query the test:

is failing in the builds:

It is failing with the following output:

p=0: *** Caught standard std::exception of type 'std::logic_error' :

  /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp:2825:

  Throw number = 1

  Throw test that evaluated to true: numBytesOut != numBytes

  unpackRow: numBytesOut = 4 != numBytes = 156.
 [FAILED]  (0.0877 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
 Location: /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859

[white23:102556] *** An error occurred in MPI_Allreduce
[white23:102556] *** reported by process [231079937,0]
[white23:102556] *** on communicator MPI_COMM_WORLD
[white23:102556] *** MPI_ERR_OTHER: known error not in list
[white23:102556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white23:102556] ***    and potentially your MPI job)

@kyungjoo-kim can you see if one of these commits may have caused this?

47f9cbe:  Tpetra - fix failing test
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Tue Jan 22 11:24:43 2019 -0700

M   packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp

3e26a55:  Tpetra - fix warning error from mismatched virtual functions
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Mon Jan 21 11:48:32 2019 -0700

M   packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_decl.hpp
M   packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp

Current Status on CDash

The current status of these tests/builds for the current testing day can be found here

Steps to Reproduce

One should be able to reproduce this failure on ride or white as described in:

More specifically, the commands given for ride or white are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
kyungjoo-kim commented 5 years ago

@fryeguy52 I will look at this problem today.

kyungjoo-kim commented 5 years ago

@fryeguy52 I followed the instruction to reproduce the error but I could not reproduce it on white. Could you double check if you can reproduce it ? I use

[kyukim @white11] master > git remote -v 
origin  https://github.com/trilinos/Trilinos.git (fetch)
origin  https://github.com/trilinos/Trilinos.git (push)
[kyukim @white11] master > git branch 
* develop
  master
[kyukim @white11] master > git log 
commit 01fb63caf88db53491f12afe0497c9d8f2cde09f
Merge: 4c4ecbf 3f3b1cc
Author: Mark Hoemmen <mhoemmen@users.noreply.github.com>
Date:   Mon Jan 21 13:12:15 2019 -0700

    Merge pull request #4224 from trilinos/Fix-4220

    MiniTensor: Attempt to fix #4220

This is output from white.

[kyukim @white11] atdm >  bsub -x -Is -q rhel7F -n 16 ctest -j16
***Forced exclusive execution
Job <43094> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on white22>>
Test project /ascldap/users/kyukim/Work/lib/trilinos/build/white/atdm
        Start   1: TpetraCore_Behavior_Default_MPI_4
        Start   2: TpetraCore_Behavior_Named_MPI_4
        Start   3: TpetraCore_Behavior_Off_MPI_4
        Start   4: TpetraCore_Behavior_On_MPI_4
  1/194 Test   #1: TpetraCore_Behavior_Default_MPI_4 ...........................................................   Passed    0.91 sec
        Start   5: TpetraCore_gemv_MPI_1
        Start   6: TpetraCore_gemm_m_eq_1_MPI_1
        Start   7: TpetraCore_gemm_m_eq_2_MPI_1
        Start   8: TpetraCore_gemm_m_eq_5_MPI_1
  2/194 Test   #2: TpetraCore_Behavior_Named_MPI_4 .............................................................   Passed    0.92 sec
        Start   9: TpetraCore_gemm_m_eq_13_MPI_1
        Start  11: TpetraCore_BlockMultiVector2_MPI_1
        Start  14: TpetraCore_BlockView_MPI_1
        Start  15: TpetraCore_BlockOps_MPI_1
  3/194 Test   #3: TpetraCore_Behavior_Off_MPI_4 ...............................................................   Passed    0.93 sec
        Start  10: TpetraCore_BlockMultiVector_MPI_4
  4/194 Test   #4: TpetraCore_Behavior_On_MPI_4 ................................................................   Passed    0.96 sec
        Start  12: TpetraCore_BlockCrsMatrix_MPI_4
  5/194 Test  #15: TpetraCore_BlockOps_MPI_1 ...................................................................   Passed    1.39 sec
        Start  16: TpetraCore_BlockExpNamespace_MPI_1
  6/194 Test  #14: TpetraCore_BlockView_MPI_1 ..................................................................   Passed    2.53 sec
        Start  31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1
  7/194 Test  #11: TpetraCore_BlockMultiVector2_MPI_1 ..........................................................   Passed    2.83 sec
        Start  32: TpetraCore_Core_ScopeGuard_where_tpetra_initializes_kokkos_MPI_1
  8/194 Test  #10: TpetraCore_BlockMultiVector_MPI_4 ...........................................................   Passed    2.82 sec
        Start  13: TpetraCore_BlockMap_MPI_4
  9/194 Test  #16: TpetraCore_BlockExpNamespace_MPI_1 ..........................................................   Passed    1.63 sec
        Start  33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1
 10/194 Test  #31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1 ............................   Passed    0.81 sec
        Start  34: TpetraCore_Core_ScopeGuard_where_user_initializes_kokkos_MPI_1
 11/194 Test  #33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1 ..............................   Passed    1.17 sec
        Start  39: TpetraCore_issue_434_already_initialized_MPI_1
 12/194 Test  #12: TpetraCore_BlockCrsMatrix_MPI_4 .............................................................   Passed    4.19 sec
...
100% tests passed, 0 tests failed out of 194

Subproject Time Summary:
Tpetra    = 1492.44 sec*proc (194 tests)

Total Test time (real) =  94.37 sec
fryeguy52 commented 5 years ago

@kyungjoo-kim Thanks for looking into this. I will try and reproduce and watch what it does tonight in testing.

bartlettroscoe commented 5 years ago

@kyungjoo-kim and @fryeguy52,

I just logged onto 'white' really quick and pulled Trilinos 'develop' as of version 3ef91e9 :

3ef91e9 "Merge Pull Request #4253 from trilinos/Trilinos/Fix-4234"
Author: trilinos-autotester <trilinos-autotester@trilinos.org>
Date:   Thu Jan 24 08:15:36 2019 -0700 (7 hours ago)

and following the instructions here I ran:

$ bsub -x -I -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-release-debug --enable-packages=TpetraCore --local-do-all 

and it returned:

FAILED (NOT READY TO PUSH): Trilinos: white26

Thu Jan 24 15:06:51 MST 2019

Enabled Packages: TpetraCore

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT_OPENMP => Test case MPI_RELEASE_DEBUG_SHARED_PT_OPENMP was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-9.2-gnu-7.2.0-release-debug => FAILED: passed=193,notpassed=1 => Not ready to push! (16.65 min)

REQUESTED ACTIONS: FAILED

The detailed test results showed:

$ grep -A 100 "failed out of" cuda-9.2-gnu-7.2.0-release-debug/ctest.out 
99% tests passed, 1 tests failed out of 194

Subproject Time Summary:
Tpetra    = 1094.70 sec*proc (194 tests)

Total Test time (real) = 139.04 sec

The following tests FAILED:
         12 - TpetraCore_BlockCrsMatrix_MPI_4 (Failed)
Errors while running CTest
kyungjoo-kim commented 5 years ago

I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.

bartlettroscoe commented 5 years ago

@kyungjoo-kim said:

I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.

Let's wait and see if @fryeguy52 can reproduce this on 'white' in his own account and go from there.

kddevin commented 5 years ago

Note that #4293 disabled this test; we'll need to re-enable it when this work is complete. TpetraCore_BlockCrsMatrix_MPI_4_DISABLE

bartlettroscoe commented 5 years ago

@kddevin said:

Note that #4293 disabled this test; we'll need to re-enable it when this work is complete. TpetraCore_BlockCrsMatrix_MPI_4_DISABLE

PR #4293 only disables that test for the CUDA PR build, not the ATDM Trilinos builds. (There is no relationship between these two sets of builds and that is on purpose.)

The question is if this failing test is something that should be fixed or not before that ATDM APPs get an updated version of Trilinos? Right now it is listed as ATDM Sev: Blocker. (But my guess is that EMPIRE is not getting impacted by this because we would have heard about.) Is this a real defect in Tpetra or just a problem with the test?

kyungjoo-kim commented 5 years ago

From the failed test message,

  unpackRow: numBytesOut = 4 != numBytes = 156.
 [FAILED]  (0.106 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
 Location: /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859

[waterman3:115046] *** An error occurred in MPI_Allreduce
[waterman3:115046] *** reported by process [3797417985,0]
[waterman3:115046] *** on communicator MPI_COMM_WORLD
[waterman3:115046] *** MPI_ERR_OTHER: known error not in list
[waterman3:115046] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[waterman3:115046] ***    and potentially your MPI job)

I first examine the packing and unpacking routine if there is some mistake in dualview sync as the input array of the unpacking, imports, is all zeros. Then, I see the All reduce error. I am not sure which trigger which error. However, it is also possible that MPI_Allreduce gets an error and it causes corruption in the importer. The other way is also possible. Something is not synced from device and it causes the MPI error. From the other test in BlockCrs, creating a map (this is not related to blockcrs code but it just happens in the blockcrs code) also fails with mpi all reduce.

kyungjoo-kim commented 5 years ago

@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?

bartlettroscoe commented 5 years ago

@kyungjoo-kim said:

@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?

Thanks for fix!

Someone will need to revert PR #4293 after we confirm that these tests are fixed in the ATDM build (where this tests was never disabled, PR #4293 only disabled them in the Trilinos PR build controlled by the @trilinos/framework team).

mhoemmen commented 5 years ago

@kyungjoo-kim Can we reenable that test now?

bartlettroscoe commented 5 years ago

With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4 seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. See table below.

I will leave this open until the CUDA PR testing gets this test enabled again by reverting PR #4293.


Tests with issue trackers Passed: twip=6 (Testing day 2019-02-05)

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
waterman Trilinos-atdm-waterman-cuda-9.2-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 17 #4257
waterman Trilinos-atdm-waterman-cuda-9.2-opt TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 17 #4257
waterman Trilinos-atdm-waterman-cuda-9.2-release-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 15 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 20 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 10 17 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 11 18 #4257
bartlettroscoe commented 5 years ago

FYI: I created the revert PR #4326 to re-enable this test in the Trilinos CUDA PR build. Just need someone to approve this PR and get it merged. Then we can close this issue.

mhoemmen commented 5 years ago

Thanks Ross! I just approved the PR.