trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

tpetra: broken unit tests with cuda 12.4 + h100 gpus #13399

Open vasylivy opened 2 months ago

vasylivy commented 2 months ago

Hi,

Reporting broken unit tests with cuda 12.4 + h100 gpus. See configuration 1 reported here https://github.com/trilinos/Trilinos/issues/13397.

334:TpetraCore_CrsMatrix_MatvecFence_MPI_4

CrsMatrix_int_longlong_double_Tpetra_KokkosCompat_KokkosCudaWrapperNode_MatvecFence_UnitTest

FenceCounter::get_count_global(exec_space.name()) = 40 == expectedGlobalCount = 60
370:TpetraCore_AsyncTransfer_UnitTests_MPI_4

p=3 | The following tests FAILED:
p=3 |     13. AsyncReverseExport_double_int_longlong_LowerTriangularCrsMatrix_UnitTest ... 
p=3 |     21. TransferArrived_double_int_longlong_CrsMatrix_forwardImportTrue_UnitTest ... 
p=3 |     23. TransferArrived_double_int_longlong_CrsMatrix_forwardExportTrue_UnitTest ... 
390:TpetraCore_MatrixMarket_Tpetra_CrsMatrix_Dist_BinaryPerProcess_simple_MPI_3
Throw number = 1

Throw test that evaluated to true: npRows * npCols != np

nProcessorCols 3 * nProcessorRows 2 = 6 must equal nProcessors 3 for 2D distribution

Tests that time out with 300s, were fine with non-UVM config. I'll have to retry these later. If you have a recommended time out let me know.

367:TpetraCore_ImportExport2_UnitTests_Send_MPI_4
369:TpetraCore_ImportExport2_UnitTests_Alltoall_MPI_4
427:TpetraCore_MatrixMatrix_UnitTests_MPI_4
428:TpetraCore_FECrs_MatrixMatrix_UnitTests_MPI_4

Thanks,

Yaro

csiefer2 commented 2 months ago

@vasylivy Relevant machine is down for upgrades. We will compare against our configuration and try to reproduce when it comes back up.

vasylivy commented 2 months ago

Tested config 1 w/ the following turned off

-DKokkos_ENABLE_CUDA_UVM=OFF -DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=OFF -DTpetra_ALLOCATE_IN_SHARED_SPACE=OFF

the unit tests pass, so it would appear to be UVM related.

Yaro

csiefer2 commented 2 months ago

@vasylivy I built all the unit tests the way the perf tests build on Hops and they all pass.

The RDC build failed because evidently you need CuSPARSE enabled to build with RDC (why?). Will fix and report back when that finishes.

I can try a UVM one as well w/o RDC.

As an aside, I just got new MPI settings from @jjellio that I need to try.

csiefer2 commented 2 months ago

@vasylivy Yeah, it appears to be UVM, because RDC by itself has exactly 1 failing test.

csiefer2 commented 1 month ago

@vasylivy UVM on tests vortex passed. I'm going to try CEE a100s and h100s to see if this is machine-specific or accelerator specific.

Edit: CEE V100 & A100 cuda-12.4 tests all pass

Second Edit: CEE H100 cuda-12.4 has a number of failing tests. So our problem is not cuda version specific, it is hardware specific.

@vbrunini

vasylivy commented 1 month ago

@csiefer2 had one failure in tpetra on ada arch w/ uvm so would indeed appear specific to hopper