Closed pierotofy closed 8 months ago
@pierotofy Do you have a high level suggestion to the path of porting CUDA code to AMD HIP? I'm planning to give it a quick try with ZLUDA. Any thoughts?
I personally would start from https://rocmdocs.amd.com/projects/HIPIFY/en/latest/ and https://rocm.docs.amd.com/projects/HIP/en/latest/
ZLUDA seems interesting too, but I haven't tried it.
Quick update: HIP is missing some functions that are used in gsplat's CUDA code. This will require some manual changes as an automatic port does not seem feasible.
hipify-clang seems more robust than hipify-perl to port CUDA code. I was able to run through these 3 cuda code translation after carefully tweaking cuda, rocm, and clang/llvm versions:
os: [ubuntu-22.04] torch-version: [2.2.1] cuda-version: [11.8.0] rocm-version: [5.6.1] llvm-version: [16]
[HIPIFY] info: file './vendor/gsplat/backward.cu' statistics:
CONVERTED refs count: 24
UNCONVERTED refs count: 0
CONVERSION %: 100.0
REPLACED bytes: 22
TOTAL bytes: 18934
CHANGED lines of code: 2
TOTAL lines of code: 502
CODE CHANGED (in bytes) %: 0.1
CODE CHANGED (in lines) %: 0.4
TIME ELAPSED s: 1.63
[HIPIFY] info: CONVERTED refs by type:
device_function: 23
include: 1
[HIPIFY] info: CONVERTED refs by API:
CUDA RT API: 24
[HIPIFY] info: CONVERTED refs by names:
__expf: 2
atomicAdd: 16
cooperative_groups.h: 1
max: 1
min: 4
[HIPIFY] info: file './vendor/gsplat/bindings.cu' statistics:
CONVERTED refs count: 4
UNCONVERTED refs count: 0
CONVERSION %: 100.0
REPLACED bytes: 75
TOTAL bytes: 20985
CHANGED lines of code: 4
TOTAL lines of code: 624
CODE CHANGED (in bytes) %: 0.4
CODE CHANGED (in lines) %: 0.6
TIME ELAPSED s: 14.87
[HIPIFY] info: CONVERTED refs by type:
include: 2
include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
CUDA Driver API: 1
CUDA RT API: 3
[HIPIFY] info: CONVERTED refs by names:
cooperative_groups.h: 1
cuda.h: 1
cuda_runtime.h: 1
cuda_runtime_api.h: 1
[HIPIFY] info: file './vendor/gsplat/forward.cu' statistics:
CONVERTED refs count: 7
UNCONVERTED refs count: 0
CONVERSION %: 100.0
REPLACED bytes: 22
TOTAL bytes: 16216
CHANGED lines of code: 2
TOTAL lines of code: 463
CODE CHANGED (in bytes) %: 0.1
CODE CHANGED (in lines) %: 0.4
TIME ELAPSED s: 0.90
[HIPIFY] info: CONVERTED refs by type:
device_function: 6
include: 1
[HIPIFY] info: CONVERTED refs by API:
CUDA RT API: 7
[HIPIFY] info: CONVERTED refs by names:
__expf: 2
__syncthreads_count: 1
cooperative_groups.h: 1
min: 3
[HIPIFY] info: file 'GLOBAL' statistics:
CONVERTED refs count: 35
UNCONVERTED refs count: 0
CONVERSION %: 100.0
REPLACED bytes: 119
TOTAL bytes: 56135
CHANGED lines of code: 8
TOTAL lines of code: 1589
CODE CHANGED (in bytes) %: 0.2
CODE CHANGED (in lines) %: 0.5
TIME ELAPSED s: 17.39
[HIPIFY] info: CONVERTED refs by type:
device_function: 29
include: 4
include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
CUDA Driver API: 1
CUDA RT API: 34
[HIPIFY] info: CONVERTED refs by names:
__expf: 4
__syncthreads_count: 1
atomicAdd: 16
cooperative_groups.h: 3
cuda.h: 1
cuda_runtime.h: 1
cuda_runtime_api.h: 1
max: 1
min: 7
[HIPIFY] info: TOTAL statistics:
CONVERTED files: 3
PROCESSED files: 3
Let's see if we can compile them into an executable file. To further test the compatibility and performance, I probably need to find a way to hook 2nd GPU (AMD Radeon RX 6700 XT) into my mini desktop.
I had trouble with the functions in cooperative_groups/reduce.h
, which don't seem to have an equivalent in HIP (but I used hipify-perl). I managed to compile it after making some manual changes, but had some crashes which I have yet to investigate.
Super experimental, proof of concept branch https://github.com/pierotofy/OpenSplat/tree/hippoc
Unfortunately, hipify-clang generates a very similar code as hipify-perl did. But it seems like most functions can be directly mapped into HIP except cg::reduce() and cg::this_thread_block(). There should be a way to manually map them:
auto block = hipThreadIdx_x + hipBlockDim_x * hipBlockIdx_x;
int32_t tile_id =
hipBlockIdx_y * tile_bounds.x + hipBlockIdx_x;
unsigned i =
hipBlockIdx_y * hipBlockDim_y + hipThreadIdx_y;
unsigned j =
hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
inline __device__ void warpSum(float& val, const int& tile){
for (int offset = tile / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xffffffff, val, offset);
}
}
BTW, ROCm HIP build pipeline is ready. What's the best way to integrate it with your working branch (hippoc)?
@pierotofy Based-on your current work (hippoc), I created a new PR#36 to consolidate our updates. It would be great to have an actual test. From my end, it might take time for me to find a way to hook up an extra AMD GPU into my desktop. It already has a Nvidia 3080 ti up running and the chassis space is very tight right now.
BTW, if you can find some AMD/Nvidia GPU resources, building a self-hosted GPU runner could be a good option for daily build & benchmark tasks.
I don't have an AMD GPU either, but I was thinking of using https://github.com/ROCm/HIP-CPU to test.
It's a pretty interesting project and could become useful for Mac port. Thank you for pointing it out. If we want to benchmark OpenSplat performance on datacenter GPU, potentially AMD Accelerator Cloud (AAC) might be able to provide the required resources for us - https://github.com/amddcgpuce/AMDAcceleratorCloudGuides
Found that HIP-CPU does not have the cooperative_groups.h header, so the entire cg namespace is missing, which is problematic, since there's a lot of references for that in the gsplat implementation.
For CPU support I would either have to refactor gsplat to not use cooperative groups or rewrite the entire thing for CPU (or wait and hope for HIP-CPU support: https://github.com/ROCm/HIP-CPU/issues/60)
Finally, swapped Nvidia with AMD GPU and almost ready for the battle test 😅
AMD GPU is now working 🚀
It should be possible to port the CUDA code to HIP and run this on AMD GPUs as well.