Open vasylivy opened 2 months ago
@vasylivy Relevant machine is down for upgrades. We will compare against our configuration and try to reproduce when it comes back up.
Tested config 1 w/ the following turned off
-DKokkos_ENABLE_CUDA_UVM=OFF -DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=OFF -DTpetra_ALLOCATE_IN_SHARED_SPACE=OFF
the unit tests pass, so it would appear to be UVM related.
Yaro
@vasylivy I built all the unit tests the way the perf tests build on Hops and they all pass.
The RDC build failed because evidently you need CuSPARSE enabled to build with RDC (why?). Will fix and report back when that finishes.
I can try a UVM one as well w/o RDC.
As an aside, I just got new MPI settings from @jjellio that I need to try.
@vasylivy Yeah, it appears to be UVM, because RDC by itself has exactly 1 failing test.
@vasylivy UVM on tests vortex passed. I'm going to try CEE a100s and h100s to see if this is machine-specific or accelerator specific.
Edit: CEE V100 & A100 cuda-12.4 tests all pass
Second Edit: CEE H100 cuda-12.4 has a number of failing tests. So our problem is not cuda version specific, it is hardware specific.
@vbrunini
@csiefer2 had one failure in tpetra on ada arch w/ uvm so would indeed appear specific to hopper
Hi,
Reporting broken unit tests with cuda 12.4 + h100 gpus. See configuration 1 reported here https://github.com/trilinos/Trilinos/issues/13397.
Tests that time out with 300s, were fine with non-UVM config. I'll have to retry these later. If you have a recommended time out let me know.
Thanks,
Yaro