AMD GPU Support via HIP

pierotofy commented 9 months ago

It should be possible to port the CUDA code to HIP and run this on AMD GPUs as well.

pfxuan commented 9 months ago

@pierotofy Do you have a high level suggestion to the path of porting CUDA code to AMD HIP? I'm planning to give it a quick try with ZLUDA. Any thoughts?

pierotofy commented 9 months ago

I personally would start from https://rocmdocs.amd.com/projects/HIPIFY/en/latest/ and https://rocm.docs.amd.com/projects/HIP/en/latest/

ZLUDA seems interesting too, but I haven't tried it.

pierotofy commented 8 months ago

Quick update: HIP is missing some functions that are used in gsplat's CUDA code. This will require some manual changes as an automatic port does not seem feasible.

pfxuan commented 8 months ago

hipify-clang seems more robust than hipify-perl to port CUDA code. I was able to run through these 3 cuda code translation after carefully tweaking cuda, rocm, and clang/llvm versions:

os: [ubuntu-22.04] torch-version: [2.2.1] cuda-version: [11.8.0] rocm-version: [5.6.1] llvm-version: [16]

[HIPIFY] info: file './vendor/gsplat/backward.cu' statistics:
  CONVERTED refs count: 24
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 22
  TOTAL bytes: 18934
  CHANGED lines of code: 2
  TOTAL lines of code: 502
  CODE CHANGED (in bytes) %: 0.1
  CODE CHANGED (in lines) %: 0.4
  TIME ELAPSED s: 1.63
[HIPIFY] info: CONVERTED refs by type:
  device_function: 23
  include: 1
[HIPIFY] info: CONVERTED refs by API:
  CUDA RT API: 24
[HIPIFY] info: CONVERTED refs by names:
  __expf: 2
  atomicAdd: 16
  cooperative_groups.h: 1
  max: 1
  min: 4

[HIPIFY] info: file './vendor/gsplat/bindings.cu' statistics:
  CONVERTED refs count: 4
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 75
  TOTAL bytes: 20985
  CHANGED lines of code: 4
  TOTAL lines of code: 624
  CODE CHANGED (in bytes) %: 0.4
  CODE CHANGED (in lines) %: 0.6
  TIME ELAPSED s: 14.87
[HIPIFY] info: CONVERTED refs by type:
  include: 2
  include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
  CUDA Driver API: 1
  CUDA RT API: 3
[HIPIFY] info: CONVERTED refs by names:
  cooperative_groups.h: 1
  cuda.h: 1
  cuda_runtime.h: 1
  cuda_runtime_api.h: 1

  [HIPIFY] info: file './vendor/gsplat/forward.cu' statistics:
  CONVERTED refs count: 7
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 22
  TOTAL bytes: 16216
  CHANGED lines of code: 2
  TOTAL lines of code: 463
  CODE CHANGED (in bytes) %: 0.1
  CODE CHANGED (in lines) %: 0.4
  TIME ELAPSED s: 0.90
[HIPIFY] info: CONVERTED refs by type:
  device_function: 6
  include: 1
[HIPIFY] info: CONVERTED refs by API:
  CUDA RT API: 7
[HIPIFY] info: CONVERTED refs by names:
  __expf: 2
  __syncthreads_count: 1
  cooperative_groups.h: 1
  min: 3
[HIPIFY] info: file 'GLOBAL' statistics:
  CONVERTED refs count: 35
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 119
  TOTAL bytes: 56135
  CHANGED lines of code: 8
  TOTAL lines of code: 1589
  CODE CHANGED (in bytes) %: 0.2
  CODE CHANGED (in lines) %: 0.5
  TIME ELAPSED s: 17.39
[HIPIFY] info: CONVERTED refs by type:
  device_function: 29
  include: 4
  include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
  CUDA Driver API: 1
  CUDA RT API: 34
[HIPIFY] info: CONVERTED refs by names:
  __expf: 4
  __syncthreads_count: 1
  atomicAdd: 16
  cooperative_groups.h: 3
  cuda.h: 1
  cuda_runtime.h: 1
  cuda_runtime_api.h: 1
  max: 1
  min: 7
[HIPIFY] info: TOTAL statistics:
  CONVERTED files: 3
  PROCESSED files: 3

Let's see if we can compile them into an executable file. To further test the compatibility and performance, I probably need to find a way to hook 2nd GPU (AMD Radeon RX 6700 XT) into my mini desktop.

pierotofy commented 8 months ago

I had trouble with the functions in cooperative_groups/reduce.h, which don't seem to have an equivalent in HIP (but I used hipify-perl). I managed to compile it after making some manual changes, but had some crashes which I have yet to investigate.

Super experimental, proof of concept branch https://github.com/pierotofy/OpenSplat/tree/hippoc

pfxuan commented 8 months ago

Unfortunately, hipify-clang generates a very similar code as hipify-perl did. But it seems like most functions can be directly mapped into HIP except cg::reduce() and cg::this_thread_block(). There should be a way to manually map them:

auto block = hipThreadIdx_x + hipBlockDim_x * hipBlockIdx_x;
int32_t tile_id =
    hipBlockIdx_y * tile_bounds.x + hipBlockIdx_x;
unsigned i =
    hipBlockIdx_y * hipBlockDim_y + hipThreadIdx_y;
unsigned j =
    hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;

inline __device__ void warpSum(float& val, const int& tile){
    for (int offset = tile / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xffffffff, val, offset);
    }
}

pfxuan commented 8 months ago

BTW, ROCm HIP build pipeline is ready. What's the best way to integrate it with your working branch (hippoc)?

pfxuan commented 8 months ago

@pierotofy Based-on your current work (hippoc), I created a new PR#36 to consolidate our updates. It would be great to have an actual test. From my end, it might take time for me to find a way to hook up an extra AMD GPU into my desktop. It already has a Nvidia 3080 ti up running and the chassis space is very tight right now.

BTW, if you can find some AMD/Nvidia GPU resources, building a self-hosted GPU runner could be a good option for daily build & benchmark tasks.

pierotofy commented 8 months ago

I don't have an AMD GPU either, but I was thinking of using https://github.com/ROCm/HIP-CPU to test.

pfxuan commented 8 months ago

It's a pretty interesting project and could become useful for Mac port. Thank you for pointing it out. If we want to benchmark OpenSplat performance on datacenter GPU, potentially AMD Accelerator Cloud (AAC) might be able to provide the required resources for us - https://github.com/amddcgpuce/AMDAcceleratorCloudGuides

pierotofy commented 8 months ago

Found that HIP-CPU does not have the cooperative_groups.h header, so the entire cg namespace is missing, which is problematic, since there's a lot of references for that in the gsplat implementation.

For CPU support I would either have to refactor gsplat to not use cooperative groups or rewrite the entire thing for CPU (or wait and hope for HIP-CPU support: https://github.com/ROCm/HIP-CPU/issues/60)

pfxuan commented 8 months ago

Finally, swapped Nvidia with AMD GPU and almost ready for the battle test 😅

pfxuan commented 8 months ago

AMD GPU is now working 🚀

https://github.com/pierotofy/OpenSplat/pull/42

pierotofy / OpenSplat

AMD GPU Support via HIP #4