kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
304 stars 96 forks source link

HIP backend general issue #806

Closed lucbv closed 1 year ago

lucbv commented 4 years ago

This issue is meant to centralize issues and work being done to integrate the HIP backend in Kokkos-Kernels. Ideally I would like other issues to be opened for specific technical issues to be opened and then referenced here so that users and developers would know what the known issues are and who is working on them.

lucbv commented 4 years ago

Here is a list of the current issues observed while building with the HIP backend:

Now that the ETI and tests are merged (or are about to be), we can make a list of what still needs to be done to get the backend fully functional.

HIP spot-check enabled tests

HIP tests currently failing

Issues in batchedDLA

  1. batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex fails with a bunch of values == 0 which seems to indicate a memory issue with complex?
  2. batched_scalar_team_trsm_l_u_t_n_dcomplex_dcomplex aborts on Memory access fault by GPU
  3. batched_scalar_team_trsm_l_u_nt_n_dcomplex_double same as dcomplex_dcomple version
  4. batched_scalar_team_trsm_l_u_t_n_dcomplex_double same as dcomplex_dcomple version
  5. batched_scalar_teamvector_qr_with_columnpivoting_double aborts on Device::callbackQueue aborting with status: 0x29
  6. batched_scalar_teamvector_solve_utv_double aborts on Memory access fault by GPU
  7. batched_scalar_teamvector_solve_utv2_double aborts on Memory access fault by GPU after failing with values == 0
  8. batched_scalar_teamvector_utv_double aborts on Memory access fault by GPU

Issues in Graph (offset==int and offset==size_t fail in the same way)

  1. graph_graph_color_double_int_int_TestExecSpace aborts on Memory access fault by GPU
  2. graph_graph_color_distance2_double_int_int_TestExecSpace aborts on Memory access fault by GPU
  3. graph_graph_color_deterministic_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016

Issues in Sparse (offset==int and offset==size_t fail in the same way)

  1. sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace aborts on Memory access fault by GPU, Note: same happens with rank2 and/or symmetric tests
  2. sparse_balloon_clustering_double_int_int_TestExecSpace aborts on Memory access fault by GPU, Note: happens randomly so quick possibly related to race condition?
  3. sparse_replaceSumIntoLonger_double_int_int_TestExecSpace fails with values == 0
  4. sparse_replaceSumIntoLonger_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  5. sparse_replaceSumInto_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  6. sparse_spgemm_jacobi_kokkos_complex_double_int_size_t_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x29
  7. sparse_spmv_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  8. sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
ndellingwood commented 4 years ago

@lucbv I'll add amd/caraway options for the testing scripts this week

lucbv commented 4 years ago

Thanks, I have shared my current configuration on the internal repo (see the Technical tips section on the homepage). One thing that I need to do is ask what extra flags are used by Kokkos for AMG builds, currently I removed all the warning/error flags as Kokkos would not build otherwise.

brian-kelley commented 4 years ago

@lucbv I have a branch now that passes unit tests for CUDA, Serial, OpenMP but will (hopefully) also work on HIP when then unit tests are built for it. The only things still hardcoded for CUDA are things involving cusparse, cublas, graphs and streams. There are a couple places where __CUDA_ARCH__ is used but that is still defined for HIP so it should be OK.

lucbv commented 4 years ago

@brian-kelley thanks for looking at this, I am still waiting on rocm/3.8.0 tests to move with the ETI/tests PR as I feel it might fix quite a few things. Hopefully I can get that done next week but I'm not sure. If your PR is ready feel free to put me as a reviewer, I will finish my review of the coarsening PR this weekend.

lucbv commented 3 years ago

Using the latest rocm LLVM compiler the new list of failing tests is much shorter:

Graph

[ RUN ] hip.graph_graph_color_deterministic_double_int_int_TestExecSpace :0:rocdevice.cpp :2325: 378970770383 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped) [ RUN ] hip.graph_graph_color_double_int_size_t_TestExecSpace :0:rocdevice.cpp :2325: 379268378835 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped)

Sparse

Some failures related to complex atomics, updates in Kokkos Core should resolve these issues.

brian-kelley commented 2 years ago

More things are working now - with rocm 4.5 and MI100 (on Caraway) all tests pass except for structured SpMV (hip.sparse_spmv_struct_double_int_size_t_TestExecSpace).

lucbv commented 1 year ago

At this point we are testing HIP in our CI, everything is building correct : )