trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

Test TpetraCore_ImportBug5430_MPI_4 randomly timing out in Trilinos-atdm-white-ride-cuda-opt build on 'white' and 'ride' #2681

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/tpetra, @fryeguy52

Next Action Status

Commit e376bd9 pushed on 5/9/2018 fixed this. The test passed with no segfault in the build Trilinos-atdm-white-ride-cuda-opt on 5/9/2018.

Description

The query:

seems to show that the test TpetraCore_ImportBug5430_MPI_4 is randomly hanging for in the build Trilinos-atdm-white-ride-cuda-opt. When the test passes, it finished in less than 2 sec. And then it times out every so often at 10 minutes. It timed out 8 times out of 44 days.

That suggests a race condition involving threads or something?

Steps to reproduce

Anyone with access to the machines 'white' or 'ride' should be able to build and run this test with as described at:

Specifically, one should be able to do:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-opt

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -R TpetraCore_ImportBug5430_MPI_4

However, given that this test is failing randomly, one may need to run it a bunch of times in hope of getting it to time out. Also, you might need to run the entire Tpetra test suite many times in order to reproduce this behavior.

bartlettroscoe commented 6 years ago

NOTE: Now that the issue with the mass of segfaulting tests in #2454 seems to have been resolved, this looks to be the only failing test for the build Trilinos-atdm-white-ride-cuda-opt as shown at:

Therefore, it would be nice to get this test fixed fairly quickly, or temporarily disable the test so that we can promote this build to the "ATDM" CDash Track/Group as described at:

mhoemmen commented 6 years ago

@bartlettroscoe I'm swamped with non-Trilinos things this weekend. I suspect the test is just slow because it UVMs all over the place. If you really need to disable it, please go ahead, but we @trilinos/tpetra folks really need to look at it. I'm worried about seeing the Tpetra tests get slower over time on CUDA, with no obvious reason why.

mhoemmen commented 6 years ago

@bartlettroscoe It may also make sense to increase the timeout temporarily on this platform, given the value of knowing whether the test passed.

bartlettroscoe commented 6 years ago

@mhoemmen, increasing the timeout likely will not help. As I tried to explain above, the history for this test shown in that CDash query shows that it either passes in under 2 sec, or it times out at 10 minutes. For example, today, that test just passed in 1s 40ms yesterday as shown :

*https://testing-vm.sandia.gov/cdash/testDetails.php?test=46193739&build=3484009

That suggests a race condition with a hang or something?

Therefore, the only way to avoid this problem is to fix the test or disable the test. It is okay if you can't fix this right away. We can just temporarily disable it and then someone can fix this later.

mhoemmen commented 6 years ago

@bartlettroscoe OK to disable for now -- we'll take a look.

bartlettroscoe commented 6 years ago

@bartlettroscoe OK to disable for now -- we'll take a look.

@mhoemmen, okay, I will. Thanks.

tjfulle commented 6 years ago

@mhoemmen , I have some time for Tpetra this week, do you want me to look at this?

mhoemmen commented 6 years ago

@tjfulle wrote:

@mhoemmen , I have some time for Tpetra this week, do you want me to look at this?

Please do -- it would be super helpful. Thanks! :-D

My guess is that it's some UVM-related issue, where the code nondeterministically thrashes UVM pages back and forth between host and device.

tjfulle commented 6 years ago

I created a brute-force script to investigate the timeout. It runs the test N times and tracks the number of failures and timeouts:

#!/bin/bash
F=TpetraCore_ImportBug5430_MPI_4
I=1
N=100
FAILED=0
TIMEDOUT=0
rm -rf Testing-*
while [[ $I -le $N ]]
do
  printf "\rRunning number $I"
  rm -rf Testing
  mkdir Testing
  timeout 30s bsub -x -Is -q rhel7F -n 16 ctest -R $F &> $F-$I.log
  returncode=$?
  if [[ $returncode == 124 ]]; then
    # Timed out!
    mv Testing Testing-$I.timedout && mv $F-$I.log Testing-$I.timedout
    TIMEDOUT=$(($TIMEDOUT+1))
  elif [[ $returncode != 0 ]]; then
    mv Testing Testing-$I.failed && mv $F-$I.log Testing-$I.failed
    FAILED=$(($FAILED+1))
  else
    mv Testing Testing-$I && mv $F-$I.log Testing-$I
  fi
  I=$(($I+1))
done
printf "\n${FAILED} failed and ${TIMEDOUT} timed out out of ${N}!\n"

The number of time outs is random, the last time I ran it 3 timed out out of 100. The time before, only 1 timed out. All of the tests the "pass" segfault:

$cat Testing/Temporary/LastTest.log
Start testing: May 08 10:00 MDT
----------------------------------------------------------
74/161 Testing: TpetraCore_ImportBug5430_MPI_4
74/161 Test: TpetraCore_ImportBug5430_MPI_4
Command: "/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec" "--mca" "orte_abort_on_non_zero_status" "0" "-np" "4" "-map-by" "socket:PE=4" "<p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe"
Directory: <p>/packages/tpetra/core/test/ImportExport
"TpetraCore_ImportBug5430_MPI_4" start time: May 08 10:00 MDT
Output:
----------------------------------------------------------
End Result: TEST PASSED
[ride16:75760] *** Process received signal ***
[ride16:75760] Signal: Segmentation fault (11)
[ride16:75760] Signal code: Address not mapped (1)
[ride16:75760] Failing at address: 0x2007bc45700
[ride16:75760] [ 0] [0x100000050478]
[ride16:75760] [ 1] [0x3ffffbeb89e0]
[ride16:75760] [ 2] [ride16:75758] *** Process received signal ***
[ride16:75758] Signal: Segmentation fault (11)
[ride16:75758] Signal code: Address not mapped (1)
[ride16:75758] Failing at address: 0x2002ef02000
[ride16:75758] [ 0] [0x100000050478]
[ride16:75758] [ 1] [0x3fffd8977290]
[ride16:75758] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75760] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75760] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75758] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75760] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75758] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75760] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75758] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75758] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75760] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75760] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75758] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75760] [ 9] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75758] [ 8] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75760] [10] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75758] [ 9] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75760] *** End of error message ***
/lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75758] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75758] *** End of error message ***
[ride16:75759] *** Process received signal ***
[ride16:75759] Signal: Segmentation fault (11)
[ride16:75759] Signal code: Address not mapped (1)
[ride16:75759] Failing at address: 0x200357a3a00
[ride16:75759] [ 0] [0x100000050478]
[ride16:75759] [ 1] [0x3fffc22dd310]
[ride16:75759] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75759] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75759] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75759] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75759] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75759] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75759] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75759] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75759] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75759] *** End of error message ***
[ride16:75761] *** Process received signal ***
[ride16:75761] Signal: Segmentation fault (11)
[ride16:75761] Signal code: Address not mapped (1)
[ride16:75761] Failing at address: 0x2004190aa50
[ride16:75761] [ 0] [0x100000050478]
[ride16:75761] [ 1] [0x3fffe5acca10]
[ride16:75761] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75761] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75761] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75761] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75761] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75761] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75761] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75761] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75761] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75761] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 75758 on node ride16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<end of output>
Test time =   1.95 sec
----------------------------------------------------------
Test Pass Reason:
Required regular expression found.Regex=[End Result: TEST PASSED
]
"TpetraCore_ImportBug5430_MPI_4" end time: May 08 10:00 MDT
"TpetraCore_ImportBug5430_MPI_4" time elapsed: 00:00:01
----------------------------------------------------------

End testing: May 08 10:00 MDT

Tpetra =   7.79 sec*proc

ctest does not report any of the tests as failed. Hundreds of core.[0-9]+ files were created in the directory where the test executable lives.

The tests that timeout don't seem to start up at all:

$ cat Testing-3.timedout/TpetraCore_ImportBug5430_MPI_4-3.log
***Forced exclusive execution
Job <318385> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride14>>
Test project /home/tjfulle/Software/VersionControl/github.com/tjfulle/Trilinos/develop/builds/gcc-5.4.0-openmpi-1.10.4-static-opt
    Start 74: TpetraCore_ImportBug5430_MPI_4
Terminated

and

$ cat Testing-3.timedout/Temporary/LastTest.log.tmp
Start testing: May 08 11:04 MDT
----------------------------------------------------------
tjfulle commented 6 years ago

If instead I run the test directly (not through ctest), all of the "passed" tests segfault as before (but the script recognizes them as failed since the return code comes from the executable and not ctest). The log file for the timed out tests also show the test as "passed":

cat cat TpetraCore_ImportBug5430_MPI_4-57.timedout
***Forced exclusive execution
Job <318546> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride11>>
End Result: TEST PASSED
[ride11:15404] *** Process received signal ***
[ride11:15404] Signal: Segmentation fault (11)
[ride11:15404] Signal code: Address not mapped (1)
[ride11:15404] Failing at address: 0x20039770230
[ride11:15404] [ 0] [0x100000050478]
[ride11:15404] [ 1] [0x3fffe8484660]
[ride11:15404] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15404] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15404] [ 4] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15404] [ 5] [ride11:15403] *** Process received signal ***
[ride11:15403] Signal: Segmentation fault (11)
[ride11:15403] Signal code: Address not mapped (1)
[ride11:15403] Failing at address: 0x20006f71560
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15404] [ 6] [ride11:15403] [ 0] [0x100000050478]
[ride11:15403] [ 1] [0x3fffe20a9fc0]
[ride11:15403] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15404] [ 7] [ride11:15400] *** Process received signal ***
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15404] [ 8] [ride11:15400] Signal: Segmentation fault (11)
[ride11:15400] Signal code: Address not mapped (1)
[ride11:15400] Failing at address: 0x2000fbc4440
[ride11:15400] [ 0] [0x100000050478]
[ride11:15400] [ 1] [0x3fffeae87750]
[ride11:15400] [ 2] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15404] [ 9] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15403] [ 3] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15404] [10] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15403] [ 4] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15404] *** End of error message ***
[ride11:15402] *** Process received signal ***
[ride11:15402] Signal: Segmentation fault (11)
[ride11:15402] Signal code: Address not mapped (1)
[ride11:15402] Failing at address: 0x200136d0ee0
[ride11:15402] [ 0] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15400] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15403] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15400] [ 4] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15403] [ 6] [0x100000050478]
[ride11:15402] [ 1] [0x3fffd2474cf0]
[ride11:15402] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15400] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15402] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15403] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15400] [ 6] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15403] [ 8] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15402] [ 4] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15403] [ 9] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15400] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15402] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15400] [ 8] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15403] [10] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15402] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15403] *** End of error message ***
./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15400] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15400] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15400] *** End of error message ***
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15402] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15402] [ 8] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15402] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15402] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15402] *** End of error message ***
mpiexec: Forwarding signal 18 to job
mpiexec: Forwarding signal 18 to job
mpiexec: Forwarding signal 18 to job

There are a few dozen more lines with mpiexec: Forwarding signal 18 to job. The don't appear in the non-timed out log files, so I guess they come from timeout shutting down the job.

tjfulle commented 6 years ago

I haven't had time to look beyond this. If @bartlettroscoe or @mhoemmen see any smoking guns in the log files from the failed or timed out runs, I'm all ears!

mhoemmen commented 6 years ago

@tjfulle Is this related to https://github.com/trilinos/Trilinos/issues/2698 somehow?

mhoemmen commented 6 years ago

@tjfulle I am able to replicate the "PASSED" but segfault issue on my GPU workstation (K80).

tjfulle commented 6 years ago

@mhoemmen, weird, you'd think ctest would detect that.

How do you enable verbose mode to get the diagnostics shown in #2698?

mhoemmen commented 6 years ago

Ooo, this could be a destructor issue. Tests that run in main() may manifest this.

bartlettroscoe commented 6 years ago

Darn, you are right. The "passing" tests shown today, for example, at:

show a segfault:


End Result: TEST PASSED
[ride10:120546] *** Process received signal ***
[ride10:120546] Signal: Segmentation fault (11)
[ride10:120546] Signal code: Address not mapped (1)
[ride10:120546] Failing at address: 0x2005c2af3f0
[ride10:120546] [ 0] [0x100000050478]
[ride10:120546] [ 1] [0x3fffd1fb2230]
[ride10:120546] [ 2] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038c414]
[ride10:120546] [ 3] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f7740]
[ride10:120546] [ 4] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101f7ae0]
[ride10:120546] [ 5] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017a928]
[ride10:120546] [ 6] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a203c]
[ride10:120546] [ 7] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10183088]
[ride10:120546] [ 8] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x10167290]
[ride10:120546] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride10:120546] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride10:120546] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 120546 on node ride10 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Really this test should should be changed to use TRIBITS_ADD_ADVANCED_TEST() and then take advantage of grepping for End Result: TEST PASSED but also require a non-zero return code. That would show this a failing.

bartlettroscoe commented 6 years ago

It must be that every so often that one of the MPI ranks segfaults and that causes a hang in one of the other MPI ranks.

tjfulle commented 6 years ago

@bartlettroscoe , I think you are right. I've run the test hundreds of times today, hangs about 3% of the time.

tjfulle commented 6 years ago

I can run the executable through valgrind to see if anything useful comes up.

mhoemmen commented 6 years ago

@tjfulle That was literally the issue: The test lives at main() scope, which means that Kokkos::finalize might get called before some Kokkos stuff gets deallocated. When I moved the test out of main() scope, it passed without segfaults.

I'll open a PR soon.

mhoemmen commented 6 years ago

@bartlettroscoe @tjfulle PR #2702 proposes a fix for this issue. I tested it with CUDA.

mhoemmen commented 6 years ago

@tjfulle wrote:

How do you enable verbose mode to get the diagnostics shown in #2698?

Run ctest manually with the -V option. For example:

$ ctest -V -R ImportBug5430
mhoemmen commented 6 years ago

I'm closing this one; please feel free to reopen if this issue should pop up again. Thanks all!

bartlettroscoe commented 6 years ago

This test TpetraCore_ImportBug5430_MPI_4 passed today in the build Trilinos-atdm-white-ride-cuda-opt showing no segfault as shown at:

The only output shown was:

End Result: TEST PASSED

Since this is no longer segfaulting, we should not expect to see this timing out randomly anymore.

This issue is therefore resolved!

@mhoemmen, thanks for fixing this!

The one request I would have is that we should not close an Issue until we get confirmation on CDash that an issue is fixed. That way we have independent evidence that an issue is resolved. And in case the issue is not fixed, then we have a reminder to look into this.

mhoemmen commented 6 years ago

@bartlettroscoe Thanks! Sorry about closing too soon!