Closed lucbv closed 1 year ago
Here is a list of the current issues observed while building with the HIP backend:
long double
specialization, see issue #807, PR #809 and PR #844 Algo::Level3::Blocked::mb()
is not defined, see issue #808 and PR #812cm_generate_makefile
to support HIP builds, see PR #818test_all_sandia
to allow spot_check on caraway (AMD/HIP platform), see PR #842execution_space=Kokkos::Experimental::HIP
path, see PR #828 and PR #840 Now that the ETI and tests are merged (or are about to be), we can make a list of what still needs to be done to get the backend fully functional.
Memory access fault by GPU
dcomplex_dcomple
versiondcomplex_dcomple
versionDevice::callbackQueue aborting with status: 0x29
Memory access fault by GPU
Memory access fault by GPU
after failing with values == 0Memory access fault by GPU
Memory access fault by GPU
Memory access fault by GPU
Device::callbackQueue aborting with status: 0x1016
Memory access fault by GPU
, Note: same happens with rank2 and/or symmetric testsMemory access fault by GPU
, Note: happens randomly so quick possibly related to race condition?values == 0
Device::callbackQueue aborting with status: 0x1016
Device::callbackQueue aborting with status: 0x1016
Device::callbackQueue aborting with status: 0x29
Device::callbackQueue aborting with status: 0x1016
Device::callbackQueue aborting with status: 0x1016
@lucbv I'll add amd/caraway options for the testing scripts this week
Thanks, I have shared my current configuration on the internal repo (see the Technical tips section on the homepage). One thing that I need to do is ask what extra flags are used by Kokkos for AMG builds, currently I removed all the warning/error flags as Kokkos would not build otherwise.
@lucbv I have a branch now that passes unit tests for CUDA, Serial, OpenMP but will (hopefully) also work on HIP when then unit tests are built for it. The only things still hardcoded for CUDA are things involving cusparse, cublas, graphs and streams. There are a couple places where __CUDA_ARCH__
is used but that is still defined for HIP so it should be OK.
@brian-kelley thanks for looking at this, I am still waiting on rocm/3.8.0
tests to move with the ETI/tests PR as I feel it might fix quite a few things. Hopefully I can get that done next week but I'm not sure.
If your PR is ready feel free to put me as a reviewer, I will finish my review of the coarsening PR this weekend.
Using the latest rocm LLVM compiler the new list of failing tests is much shorter:
[ RUN ] hip.graph_graph_color_deterministic_double_int_int_TestExecSpace :0:rocdevice.cpp :2325: 378970770383 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped) [ RUN ] hip.graph_graph_color_double_int_size_t_TestExecSpace :0:rocdevice.cpp :2325: 379268378835 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped)
Some failures related to complex atomics, updates in Kokkos Core should resolve these issues.
More things are working now - with rocm 4.5 and MI100 (on Caraway) all tests pass except for structured SpMV (hip.sparse_spmv_struct_double_int_size_t_TestExecSpace
).
At this point we are testing HIP in our CI, everything is building correct : )
This issue is meant to centralize issues and work being done to integrate the HIP backend in Kokkos-Kernels. Ideally I would like other issues to be opened for specific technical issues to be opened and then referenced here so that users and developers would know what the known issues are and who is working on them.