Closed bartlettroscoe closed 6 years ago
NOTE: Now that the issue with the mass of segfaulting tests in #2454 seems to have been resolved, this looks to be the only failing test for the build Trilinos-atdm-white-ride-cuda-opt
as shown at:
Therefore, it would be nice to get this test fixed fairly quickly, or temporarily disable the test so that we can promote this build to the "ATDM" CDash Track/Group as described at:
@bartlettroscoe I'm swamped with non-Trilinos things this weekend. I suspect the test is just slow because it UVMs all over the place. If you really need to disable it, please go ahead, but we @trilinos/tpetra folks really need to look at it. I'm worried about seeing the Tpetra tests get slower over time on CUDA, with no obvious reason why.
@bartlettroscoe It may also make sense to increase the timeout temporarily on this platform, given the value of knowing whether the test passed.
@mhoemmen, increasing the timeout likely will not help. As I tried to explain above, the history for this test shown in that CDash query shows that it either passes in under 2 sec, or it times out at 10 minutes. For example, today, that test just passed in 1s 40ms yesterday as shown :
*https://testing-vm.sandia.gov/cdash/testDetails.php?test=46193739&build=3484009
That suggests a race condition with a hang or something?
Therefore, the only way to avoid this problem is to fix the test or disable the test. It is okay if you can't fix this right away. We can just temporarily disable it and then someone can fix this later.
@bartlettroscoe OK to disable for now -- we'll take a look.
@bartlettroscoe OK to disable for now -- we'll take a look.
@mhoemmen, okay, I will. Thanks.
@mhoemmen , I have some time for Tpetra this week, do you want me to look at this?
@tjfulle wrote:
@mhoemmen , I have some time for Tpetra this week, do you want me to look at this?
Please do -- it would be super helpful. Thanks! :-D
My guess is that it's some UVM-related issue, where the code nondeterministically thrashes UVM pages back and forth between host and device.
I created a brute-force script to investigate the timeout. It runs the test N
times and tracks the number of failures and timeouts:
#!/bin/bash
F=TpetraCore_ImportBug5430_MPI_4
I=1
N=100
FAILED=0
TIMEDOUT=0
rm -rf Testing-*
while [[ $I -le $N ]]
do
printf "\rRunning number $I"
rm -rf Testing
mkdir Testing
timeout 30s bsub -x -Is -q rhel7F -n 16 ctest -R $F &> $F-$I.log
returncode=$?
if [[ $returncode == 124 ]]; then
# Timed out!
mv Testing Testing-$I.timedout && mv $F-$I.log Testing-$I.timedout
TIMEDOUT=$(($TIMEDOUT+1))
elif [[ $returncode != 0 ]]; then
mv Testing Testing-$I.failed && mv $F-$I.log Testing-$I.failed
FAILED=$(($FAILED+1))
else
mv Testing Testing-$I && mv $F-$I.log Testing-$I
fi
I=$(($I+1))
done
printf "\n${FAILED} failed and ${TIMEDOUT} timed out out of ${N}!\n"
The number of time outs is random, the last time I ran it 3 timed out out of 100. The time before, only 1 timed out. All of the tests the "pass" segfault:
$cat Testing/Temporary/LastTest.log
Start testing: May 08 10:00 MDT
----------------------------------------------------------
74/161 Testing: TpetraCore_ImportBug5430_MPI_4
74/161 Test: TpetraCore_ImportBug5430_MPI_4
Command: "/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec" "--mca" "orte_abort_on_non_zero_status" "0" "-np" "4" "-map-by" "socket:PE=4" "<p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe"
Directory: <p>/packages/tpetra/core/test/ImportExport
"TpetraCore_ImportBug5430_MPI_4" start time: May 08 10:00 MDT
Output:
----------------------------------------------------------
End Result: TEST PASSED
[ride16:75760] *** Process received signal ***
[ride16:75760] Signal: Segmentation fault (11)
[ride16:75760] Signal code: Address not mapped (1)
[ride16:75760] Failing at address: 0x2007bc45700
[ride16:75760] [ 0] [0x100000050478]
[ride16:75760] [ 1] [0x3ffffbeb89e0]
[ride16:75760] [ 2] [ride16:75758] *** Process received signal ***
[ride16:75758] Signal: Segmentation fault (11)
[ride16:75758] Signal code: Address not mapped (1)
[ride16:75758] Failing at address: 0x2002ef02000
[ride16:75758] [ 0] [0x100000050478]
[ride16:75758] [ 1] [0x3fffd8977290]
[ride16:75758] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75760] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75760] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75758] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75760] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75758] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75760] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75758] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75758] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75760] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75760] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75758] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75760] [ 9] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75758] [ 8] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75760] [10] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75758] [ 9] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75760] *** End of error message ***
/lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75758] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75758] *** End of error message ***
[ride16:75759] *** Process received signal ***
[ride16:75759] Signal: Segmentation fault (11)
[ride16:75759] Signal code: Address not mapped (1)
[ride16:75759] Failing at address: 0x200357a3a00
[ride16:75759] [ 0] [0x100000050478]
[ride16:75759] [ 1] [0x3fffc22dd310]
[ride16:75759] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75759] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75759] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75759] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75759] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75759] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75759] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75759] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75759] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75759] *** End of error message ***
[ride16:75761] *** Process received signal ***
[ride16:75761] Signal: Segmentation fault (11)
[ride16:75761] Signal code: Address not mapped (1)
[ride16:75761] Failing at address: 0x2004190aa50
[ride16:75761] [ 0] [0x100000050478]
[ride16:75761] [ 1] [0x3fffe5acca10]
[ride16:75761] [ 2] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride16:75761] [ 3] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride16:75761] [ 4] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride16:75761] [ 5] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride16:75761] [ 6] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride16:75761] [ 7] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride16:75761] [ 8] <p>/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride16:75761] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride16:75761] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride16:75761] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 75758 on node ride16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<end of output>
Test time = 1.95 sec
----------------------------------------------------------
Test Pass Reason:
Required regular expression found.Regex=[End Result: TEST PASSED
]
"TpetraCore_ImportBug5430_MPI_4" end time: May 08 10:00 MDT
"TpetraCore_ImportBug5430_MPI_4" time elapsed: 00:00:01
----------------------------------------------------------
End testing: May 08 10:00 MDT
Tpetra = 7.79 sec*proc
ctest
does not report any of the tests as failed. Hundreds of core.[0-9]+
files were created in the directory where the test executable lives.
The tests that timeout don't seem to start up at all:
$ cat Testing-3.timedout/TpetraCore_ImportBug5430_MPI_4-3.log
***Forced exclusive execution
Job <318385> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride14>>
Test project /home/tjfulle/Software/VersionControl/github.com/tjfulle/Trilinos/develop/builds/gcc-5.4.0-openmpi-1.10.4-static-opt
Start 74: TpetraCore_ImportBug5430_MPI_4
Terminated
and
$ cat Testing-3.timedout/Temporary/LastTest.log.tmp
Start testing: May 08 11:04 MDT
----------------------------------------------------------
If instead I run the test directly (not through ctest
), all of the "passed" tests segfault as before (but the script recognizes them as failed since the return code comes from the executable and not ctest
). The log file for the timed out tests also show the test as "passed":
cat cat TpetraCore_ImportBug5430_MPI_4-57.timedout
***Forced exclusive execution
Job <318546> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride11>>
End Result: TEST PASSED
[ride11:15404] *** Process received signal ***
[ride11:15404] Signal: Segmentation fault (11)
[ride11:15404] Signal code: Address not mapped (1)
[ride11:15404] Failing at address: 0x20039770230
[ride11:15404] [ 0] [0x100000050478]
[ride11:15404] [ 1] [0x3fffe8484660]
[ride11:15404] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15404] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15404] [ 4] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15404] [ 5] [ride11:15403] *** Process received signal ***
[ride11:15403] Signal: Segmentation fault (11)
[ride11:15403] Signal code: Address not mapped (1)
[ride11:15403] Failing at address: 0x20006f71560
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15404] [ 6] [ride11:15403] [ 0] [0x100000050478]
[ride11:15403] [ 1] [0x3fffe20a9fc0]
[ride11:15403] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15404] [ 7] [ride11:15400] *** Process received signal ***
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15404] [ 8] [ride11:15400] Signal: Segmentation fault (11)
[ride11:15400] Signal code: Address not mapped (1)
[ride11:15400] Failing at address: 0x2000fbc4440
[ride11:15400] [ 0] [0x100000050478]
[ride11:15400] [ 1] [0x3fffeae87750]
[ride11:15400] [ 2] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15404] [ 9] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15403] [ 3] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15404] [10] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15403] [ 4] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15404] *** End of error message ***
[ride11:15402] *** Process received signal ***
[ride11:15402] Signal: Segmentation fault (11)
[ride11:15402] Signal code: Address not mapped (1)
[ride11:15402] Failing at address: 0x200136d0ee0
[ride11:15402] [ 0] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15400] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15403] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15400] [ 4] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15403] [ 6] [0x100000050478]
[ride11:15402] [ 1] [0x3fffd2474cf0]
[ride11:15402] [ 2] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15400] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038e9b4]
[ride11:15402] [ 3] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15403] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15400] [ 6] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15403] [ 8] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f9ce0]
[ride11:15402] [ 4] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15403] [ 9] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15400] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101fa080]
[ride11:15402] [ 5] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15400] [ 8] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15403] [10] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017ced8]
[ride11:15402] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15403] *** End of error message ***
./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15400] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15400] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15400] *** End of error message ***
./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a45dc]
[ride11:15402] [ 7] ./TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10185638]
[ride11:15402] [ 8] ./TpetraCore_ImportBug5430.exe(main+0xaa0)[0x101697a0]
[ride11:15402] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride11:15402] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride11:15402] *** End of error message ***
mpiexec: Forwarding signal 18 to job
mpiexec: Forwarding signal 18 to job
mpiexec: Forwarding signal 18 to job
There are a few dozen more lines with mpiexec: Forwarding signal 18 to job
. The don't appear in the non-timed out log files, so I guess they come from timeout
shutting down the job.
I haven't had time to look beyond this. If @bartlettroscoe or @mhoemmen see any smoking guns in the log files from the failed or timed out runs, I'm all ears!
@tjfulle Is this related to https://github.com/trilinos/Trilinos/issues/2698 somehow?
@tjfulle I am able to replicate the "PASSED" but segfault issue on my GPU workstation (K80).
@mhoemmen, weird, you'd think ctest
would detect that.
How do you enable verbose mode to get the diagnostics shown in #2698?
Ooo, this could be a destructor issue. Tests that run in main()
may manifest this.
Darn, you are right. The "passing" tests shown today, for example, at:
show a segfault:
End Result: TEST PASSED
[ride10:120546] *** Process received signal ***
[ride10:120546] Signal: Segmentation fault (11)
[ride10:120546] Signal code: Address not mapped (1)
[ride10:120546] Failing at address: 0x2005c2af3f0
[ride10:120546] [ 0] [0x100000050478]
[ride10:120546] [ 1] [0x3fffd1fb2230]
[ride10:120546] [ 2] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra7Details22teuchosCommIsAnMpiCommERKN7Teuchos4CommIiEE+0x34)[0x1038c414]
[ride10:120546] [ 3] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED1Ev+0x300)[0x101f7740]
[ride10:120546] [ 4] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEED0Ev+0x20)[0x101f7ae0]
[ride10:120546] [ 5] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos11RCPNodeTmplIN6Tpetra3MapIiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_4CudaENS3_12CudaUVMSpaceEEEEENS_13DeallocDeleteIS9_EEE10delete_objEv+0x78)[0x1017a928]
[ride10:120546] [ 6] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle15unbindOneStrongEv+0x2c)[0x105a203c]
[ride10:120546] [ 7] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(_ZN7Teuchos13RCPNodeHandle6unbindEv+0x78)[0x10183088]
[ride10:120546] [ 8] /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/ImportExport/TpetraCore_ImportBug5430.exe(main+0xaa0)[0x10167290]
[ride10:120546] [ 9] /lib64/libc.so.6(+0x24700)[0x10000bc04700]
[ride10:120546] [10] /lib64/libc.so.6(__libc_start_main+0xc4)[0x10000bc048f4]
[ride10:120546] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 120546 on node ride10 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Really this test should should be changed to use TRIBITS_ADD_ADVANCED_TEST()
and then take advantage of grepping for End Result: TEST PASSED
but also require a non-zero return code. That would show this a failing.
It must be that every so often that one of the MPI ranks segfaults and that causes a hang in one of the other MPI ranks.
@bartlettroscoe , I think you are right. I've run the test hundreds of times today, hangs about 3% of the time.
I can run the executable through valgrind to see if anything useful comes up.
@tjfulle That was literally the issue: The test lives at main() scope, which means that Kokkos::finalize might get called before some Kokkos stuff gets deallocated. When I moved the test out of main() scope, it passed without segfaults.
I'll open a PR soon.
@bartlettroscoe @tjfulle PR #2702 proposes a fix for this issue. I tested it with CUDA.
@tjfulle wrote:
How do you enable verbose mode to get the diagnostics shown in #2698?
Run ctest manually with the -V
option. For example:
$ ctest -V -R ImportBug5430
I'm closing this one; please feel free to reopen if this issue should pop up again. Thanks all!
This test TpetraCore_ImportBug5430_MPI_4
passed today in the build Trilinos-atdm-white-ride-cuda-opt
showing no segfault as shown at:
The only output shown was:
End Result: TEST PASSED
Since this is no longer segfaulting, we should not expect to see this timing out randomly anymore.
This issue is therefore resolved!
@mhoemmen, thanks for fixing this!
The one request I would have is that we should not close an Issue until we get confirmation on CDash that an issue is fixed. That way we have independent evidence that an issue is resolved. And in case the issue is not fixed, then we have a reminder to look into this.
@bartlettroscoe Thanks! Sorry about closing too soon!
CC: @trilinos/tpetra, @fryeguy52
Next Action Status
Commit e376bd9 pushed on 5/9/2018 fixed this. The test passed with no segfault in the build
Trilinos-atdm-white-ride-cuda-opt
on 5/9/2018.Description
The query:
seems to show that the test
TpetraCore_ImportBug5430_MPI_4
is randomly hanging for in the buildTrilinos-atdm-white-ride-cuda-opt
. When the test passes, it finished in less than 2 sec. And then it times out every so often at 10 minutes. It timed out 8 times out of 44 days.That suggests a race condition involving threads or something?
Steps to reproduce
Anyone with access to the machines 'white' or 'ride' should be able to build and run this test with as described at:
Specifically, one should be able to do:
However, given that this test is failing randomly, one may need to run it a bunch of times in hope of getting it to time out. Also, you might need to run the entire Tpetra test suite many times in order to reproduce this behavior.