AMDGPU backend - Githubissues

masahi commented 4 years ago

Hi, I think it'd be cool to add support for AMDGPU using LLVM's AMDGPU backend and ROCm software stack (their CUDA equivalent). If you already have NVPTX backend working, I don't think it is too hard to add support for AMDGPU as well. And the graphics on AMDGPU is fantastic.

In fact, the apache project TVM has AMDGPU codegen support which was almost copy-pasted from NVPTX backend. You also need a runtime, but again TVM's implementation can help.

I helped bringing up their support for AMD more than two years ago and that was how I started contributing to TVM. As a computer graphics enthusiast I'm also very interested in this project and am looking for opportunities to contribute. If you are interested in this topic, I can start take a look @yuanming-hu

masahi commented 4 years ago

Disclaimer: I don't work for AMD and have no relationship with them whatsoever. I just like my GPU.

yuanming-hu commented 4 years ago

I think it's a great idea! One question I have:

Is supporting AMD GPUs via AMDGCN better than via OpenCL?

masahi commented 4 years ago

yeah good question if you already had supporting OpenCL in mind. I think there are many metrics for "being better" (performance, tooling, ease of development) etc. For AMD it is a question of supporting OpenCL or ROCm, so let me try giving pros/cons on each:

OpenCL pros:

Cross platform, can support other backends (I can think of Adreno and Mali, but if I remember correctly Apple ditched OpenCL and Android is also moving away from OpenCL in favor of Vulkan)
AMD has OpenCL drivers for both Win and Linux
Can share codegen implementation with CUDA (at least that is how TVM implements CUDA and OpenCL codegen)

OpenCL cons:

From what I see OpenCL on AMD is more like in "maintenance mode". They are stuck with OpenCL 2.0 forever. But I would say this applies to industry in general (Apple, Android etc)

ROCm pros:

AMD is actively pushing software stack built on ROCm (actively developed libraries for math, DL etc only work on ROCm environment).
Has more potential for optimizing for AMDGPU specifically (even inline asm is also possible) than "generic" OpenCL backend. We at TVM compared performance on OpenCL vs ROCm after auto tuning, and we saw better performance on ROCm.
Can share codegen implementation with NVPTX (in principle just needs to replace NV specific intrinsic for thread id, barrier instruction etc with AMD ones)
AMD's open source OpenCL driver is built on top of ROCm stack and this is the direction they are going.

ROCm cons:

For now it is Linux only.
Software stack itself is not mature compared to CUDA (but I'd say it is match better than OpenCL on AMD)

KLozes commented 4 years ago

I don't know much about ROCm, but there doesn't seem to be a way to allocate virtual memory yet. In C++ and CUDA this is done with mmap and cudaMallocManged, respectively. Taichi relies on Virtual memory and page faulting mechanisms in its memory management design (atleast for sparse data structures). So you may need to wait until ROCm supports these features before implementing a backend for it.

masahi commented 4 years ago

It's that unified memory thing right? It does seem to be supported by ROCm. See https://github.com/ROCm-Developer-Tools/HIP/search?q=cudaMallocManaged&unscoped_q=cudaMallocManaged

KLozes commented 4 years ago

Yes it is the unified memory thing. Maybe hipMallocManaged would work too, but it would need some testing to see if it actually behaves like mmap. That is, can it be used to reserve a huge amount of virtual memory without allocating physical memory until pages are touched? This is a relatively new feature of cudaMallocManaged and it isn't really advertised. I wouldn't be super suprised if hip doesnt support this specific thing yet. But it would be cool if it does!

yuanming-hu commented 4 years ago

@masahi @KLozes thanks for the discussions! I learned a lot from you guys regarding AMDGPU :-)

Based on what I learned, it seems to me that AMDGPU is a better option than OpenCL for Taichi, because

According to @masahi, AMDGPU can share codegen with NVPTX and it not too much work (not sure about the messy things such as linking against libdevice.bc though)
The CUDA backend is obsolete in Taichi now, since its compilation speed is too slow compared to LLVM NVPTX codegen. (Taichi used to take > 1 minute to compile all CUDA kernels in a program...) Also invoking nvcc sometimes also leads to issues. Therefore OpenCL sharing codegen with CUDA is not really an OpenCL Pro for Taichi as this stage.
My feeling about OpenCL is roughly the same as @masahi: hardware manufacturers are not very active to support its new versions
I'd say it's sad that ROCm supports only Linux (which in my understanding is mostly used by developers). Hopefully, they can support Windows soon and compete with CUDA on all platforms...

(Sorry about my delayed reply. I was busy fixing an urgent bug for v0.4.1...)

Regarding unified memory and Taichi's memory allocator, that's another story. I'll post more thoughts on those tomorrow.

masahi commented 4 years ago

Linking with bitcode on AMD during LLVM codegen is straightforward. I did that for TVM in https://github.com/apache/incubator-tvm/pull/570

yuanming-hu commented 4 years ago

Cool! One less thing to worry about!

yuanming-hu commented 4 years ago

A figure illustrating the current memory management system in Taichi. More details coming tomorrow. I'm considering to support the backends without hardware unified memory as well, depending on how soon every device will support unified memory...

yuanming-hu commented 4 years ago

Considering removing the dependency on unified memory since it seems CUDA on Windows does not have very good support for it... http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf (search for "windows")

Screenshot from 2020-02-02 01-22-49

KLozes commented 4 years ago

I can see how removing the dependency on unified memory would make Taichi more portable. But, do you think it will slow down Taichi an appreciable amount? I imagine data activation would be slower without the page faulting mechanism. Also, the bitmasked data structure wouldn't be possible anymore, which I think should have less memory fragmentation than using pointers for sparsity.

yuanming-hu commented 4 years ago

I think the tradeoff would be between performance and memory fragmentation. If we allocate smaller pieces of memory trunks to reduce fragmentation, there will be more allocation/locks contention, which will harm performance.

I wouldn't worry about performance too much. A bigger issue would be supporting Python-scope (single) tensor element access. Without unified memory reading/writing a single element means one kernel launch. For writing, we need batching the write requests to reduce #kernel launches. For reading we have to do prefetching/caching, otherwise, there will be a lot of kernel launches and cudaMemcpy across PCI-e... This will need a lot of refactoring to realize.

Or we simply disable Python-scope data access, but instruct the users to use numpy instead.

KLozes commented 4 years ago

I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.

masahi commented 4 years ago

Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is supported or not?

I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.

Or is low level memory management details strongly tied to the rest of the system? If this constraint comes from the need to handle sparse data structures, I am interested. I believe this is not an issue for "dense" domain system such as halide or tvm.

yuanming-hu commented 4 years ago

I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.

Yeah I think the worst case is that we will still support it, yet at a cost of one kernel launch per read/write in Python.

Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is surported or not? I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.

Good question. We want unified memory because memory management performance is important, especially when we have sparse data structures. We can, of course, go without unified memory, if we support dense structures only, or simply implement sparse data structures with a higher cost.

masahi commented 4 years ago

ok I took a quick a look at the code base and I see some good refactoring opportunities there:

Currently, LLVM GPU means NV and CUDA and the code is hard coded that way. I want to separate the concept of "GPU" from "CUDA" so that AMDGPU backend can share some code with NVPTX.
There are many places in the code base where if/else is used to dispatch into backend specific logics, like below. This quickly becomes awkward as we add more backends in the future (AMDGPU, ARM, upcoming fancy dGPU from Intel..?). I think using abstract interfaces and virtual method calls are better for code hygiene. They can also be used to hide memory management details I mentioned above.
```
if (arch == Arch::x86_64) {
  // do something for x64
} else if (arch == Arch::cuda) {
 // do something else for cuda 
} else {
...
}
```

If you think this is a good idea, I can open a separate issue to discuss some refactoring plans. I want them to be a prereq for AMD work.

yuanming-hu commented 4 years ago

I think these are great ideas! Please feel free to propose a more hygienic refactoring solution. Starting with just two backends (x64 and CUDA), a lot of legacy designs will no longer suit the current tendency of having more and more backends.

The current list of potentially supported backends is here: https://github.com/taichi-dev/taichi/blob/5866eb5148297941e82e9998d48ea2eed0d9bf01/taichi/inc/archs.inc.h

k-ye commented 4 years ago

I'm quite interested to learn how you plan to remove the dependency on unified memory. Currently the Metal backend (#396 ) can only support dense snodes, and I took a look at the dynamic snodes. The memory allocation part seems to happen in request_allocate_aligned, where the CUDA side busy loops waiting for the host side to pass back the new memory address.

For Metal, we may be able to do the host/kernel sync via MTLSharedEvent. However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well? If so, what would be a good way to pass back new chunks of memories?

yuanming-hu commented 4 years ago

Yeah, currently the host-device communication via request_allocate_aligned is pretty tricky and might lead to some portability issues.

The worst case is that we simply pre-allocate a, say 2GB, buffer, before the kernel launch, and define the cases where a single kernel activates > 2GB memory as undefined behavior on devices where host/kernel sync is not supported...

Going to sleep now and will think more about this tomorrow.

yuanming-hu commented 4 years ago

Unfortunately, after more investigation I think we should not depend on unified memory anymore. Because

Devices such as OpenGL compute shader/AMD GPUs does not support it
Even for NVIDIA GPUs, support on Windows is poor https://devtalk.nvidia.com/default/topic/1030191/cuda-unified-memory-oversubscription-in-windows-systems/
According to my experiments, the unified memory support on NVIDIA Jetson TX2 is also a little mysterious.
In practice I've also found the NVIDIA driver to be problematic under the current request-based memory allocation during a single kernel launch. It seems to me that for some magical reason after a cudaMallocManaged during another kernel launch, CUDA can no longer JIT compile new PTX (loadModule).

This means the memory allocator needs to pre-allocate a huge piece of memory ahead of time.

For Python-scope accesses, we need to either launch a GPU kernel or maintain a software cache for more batched load/store.

archibate commented 4 years ago

However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well?

But I think bindings should be able when no kernel is running, does that help (dynamic buffer)?

If so, what would be a good way to pass back new chunks of memories?

The only way to pass back data is glMapBuffer, map into somewhere in host memory. The best I could do if snode_reader/writer not x86_64 is only map: args, extra_args, external_ptr, but not root.

github-actions[bot] commented 4 years ago

Warning: The issue has been out-of-update for 50 days, marking stale.

xinyazhang commented 4 years ago

I think it's a great idea! One question I have:
* Is supporting AMD GPUs via AMDGCN better than via OpenCL?
A Vulkan (Metal equivalent on Linux) backend may be a better fit. There is no OSS OpenCL implementation on Linux yet, and amdgpu-pro is not attractive to many users because its quality is actually worse than OSS driver.

yuanming-hu commented 3 years ago

Hi @masahi, we (Taichi Graphics team in Beijing) are getting increasingly interested in creating an AMDGPU backend for Taichi via LLVM. One question we have: how many types of ADM GPUs do hip (AMD's CUDA correspondence) support? Is it true that recent consumer-level GPUs by AMD do not support GPGPU via a system like CUDA?

@yolo2themoon @ailzhang at the company will be working on/helping with this. Of course, your inputs / PRs are very welcome!

archibate commented 3 years ago

Why not just OpenCL for all?

无法顺畅的大口呼吸，是活着的最好证明

---Original--- From: "Yuanming @.> Date: Wed, Aug 18, 2021 17:46 PM To: @.>; Cc: @.**@.>; Subject: Re: [taichi-dev/taichi] AMDGPU backend (#412)

Hi @masahi, we (Taichi Graphics team at Beijing) are getting increasingly interested in creating an AMDGPU backend for Taichi via LLVM. One question we have: how many types of ADM GPUs do hip (AMD's CUDA correspondence) support? Is it true that recent consumer-level GPUs by AMD do not support GPGPU via a system like CUDA?

@yolo2themoon @ailzhang at the company will be working on/helping with this. Of course, your inputs / PRs are very welcome!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

masahi commented 3 years ago

@yuanming-hu Yes, unfortunately Rocm, their CUDA equivalent, has extremely poor support for consumer GPUs. The only ones that are officially supported are Rx Vega 54 or 64 series, which very few people own and impossible to get right now. They killed support for older consumer GPUs (gfx 803 series), including mine, recently. Right now AMD's focus in terms of GPU compute are on data center market and supercomputer use cases.

That said, apparently AMD is working on enabling Rocm support for the current generation consumer GPUs (Navi 2, RX 6000 series). It is rumored to be out this year. If that becomes real, I think rocm backend for Taichi would become indeed interesting. If there is a positive interest from the team, I'm definitely willing to help!

In the mean time, I think having vulkan support already shipped is a great step toward supporting AMDGPU! I'm really excited for this.

archibate commented 3 years ago

So rocm is so poor and almost no trivial people can use it, why not merge my opencl pr now so that even macos and cpu-only can enjoy!

无法顺畅的大口呼吸，是活着的最好证明

---Original--- From: @.> Date: Wed, Aug 18, 2021 18:02 PM To: @.>; Cc: @.**@.>; Subject: Re: [taichi-dev/taichi] AMDGPU backend (#412)

@yuanming-hu Yes, unfortunately Rocm, their CUDA equivalent, has extremely poor support for consumer GPUs. The only ones that are officially supported are Rx Vega 54 or 64 series, which very few people own and impossible to get right now. They killed support for older consumer GPUs (gfx 803 series), including mine, recently. Right now AMD's focus in terms of GPU compute are on data center market and supercomputer use cases.

That said, apparently AMD is working on enabling Rocm support for the current generation consumer GPUs (Navi 2, RX 6000 series). It is rumored to be out this year. If that becomes real, I think rocm backend for Taichi would become indeed interesting! If there is a positive interest from the team, I'm definitely willing to help!

In the mean time, I think having vulkan support already shipped is a great step toward AMDGPU! I'm really excited for this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

jammm commented 2 years ago

Hey guys, I just want to point out that ~ROCM 4.5 now supports RDNA2 GPU's, and it seems to work on RDNA1 GPU's too.~ official support only on windows though. Not out on Linux yet. There's also support on windows from the 21.40 driver, using which you can run Blender 3.0 with the HIP backend https://gpuopen.com/blender-cycles-amd-gpu/ with Linux support coming in blender 3.1 (the devs are part of my team)

I was wondering, given that the latest consumer GPU's are now supported, what else would be required in order to get this backend working? If there's anything I can help with, let me know :)

Disclaimer: I work at AMD, but I'd be helping in my personal capacity and any comments I make are my personal opinions and not AMD's.

bobcao3 commented 2 years ago

Hey guys, I just want to point out that ROCM 4.5 now supports RDNA2 GPU's, and it seems to work on RDNA1 GPU's too. There's also support on windows from the 21.40 driver, using which you can run Blender 3.0 with the HIP backend https://gpuopen.com/blender-cycles-amd-gpu/ with Linux support coming in blender 3.1 (the devs are part of my team)

I was wondering, given that the latest consumer GPU's are now supported, what else would be required in order to get this backend working? If there's anything I can help with, let me know :)

Disclaimer: I work at AMD, but I'd be helping in my personal capacity and any comments I make are my personal opinions and not AMD's.

IIRC HIP is designed to be compatible with CUDA driver API right? So in theory we can just change minimal amount of code to make the cuda backend work on HIP. I don't personally own an RDNA card, is there a chance that Vega will work?

jammm commented 2 years ago

I haven't seen the existing backend so I'm not 100% sure, but if the existing backend consists of CUDA calls then sure, it should be straightforward to port them to HIP. Having said that, if it relies on PTX generation, then it may be more tricky as HIP doesn't generate PTX on AMD GPU's. Maybe someone more knowledgeable on the NVPTX backend can comment on this?

As for Vega, I think it should be supported for Vega VII and Vega 64 according to this https://github.com/RadeonOpenCompute/ROCm#hardware-and-software-support I don't know about support for the other Vega based GPU's though.

masahi commented 2 years ago

The idea is to (1) refactor the existing LLVM GPU backend supporting NVPTX to also support AMDGPU and (2) develop HIP-specific runtime code which maps mostly 1-to-1 with CUDA driver API.

For (1), most of the codegen-related code can be shared between NVPTX and AMDGPU - See for example how similar two backends are in TVM:

NV specific code: https://github.com/apache/tvm/blob/main/src/target/llvm/codegen_nvptx.cc
AMD: https://github.com/apache/tvm/blob/main/src/target/llvm/codegen_amdgpu.cc
Shared LLVM codegen https://github.com/apache/tvm/blob/main/src/target/llvm/codegen_llvm.cc

(2) should not be a lot of work either: For example, in TVM these two files implement all rocm-specific runtime API and together they are about 500 lines of code

k-ye commented 2 years ago

This is great news & thanks for the discussion!

@masahi Regarding the AMDGPU backend, is there any suggestions & good practices in the TVM community for its test coverage? From our experience with Vulkan, implementing the Vulkan backend wasn't so hard, but making sure it always runs did take us a huge amount of efforts (setting up the docker, making sure it covers all the three OS, etc ...)

Refactoring & splitting the LLVM codegen sounds like a good starting point. I'm listing Taichi's LLVM codegen here, in case anyone is interested in starting this efforts :-)

https://github.com/taichi-dev/taichi/blob/master/taichi/codegen/codegen_llvm.cpp

Following our regular approach for adding a new backend, we can just aim at supporting dense fields on AMDGPU as a start.

masahi commented 2 years ago

@masahi Regarding the AMDGPU backend, is there any suggestions & good practices in the TVM community for its test coverage?

We don't run any CI tests on AMD :disappointed: Not many people are interested in rocm and it is very hard to get rocm-capable AMD GPU. I've heard that AMD runs their rocm QA using TVM and they occasionally send fix PRs to keep our rocm support up-to-date. So the only suggestion I can give is to have AMD involved... (which is actually not a joke since that would help improve the ecosystem around rocm).

In general, the test coverage for non-cuda GPU targets in TVM is not good. I'm impressed to hear that the taichi community is investing heavily in vulkan testing, even on multiple OS!

jammm commented 2 years ago

Oops - I just realized that it's not rocm 4.5 that supports RDNA2 just yet, but it's the latest windows drivers which supports HIP on RDNA2. rocm 4.5 on linux still doesn't officially support HIP with rdna2 yet. But there's hope that it'll be out in rocm 5.x coming out soon.

taichi-dev / taichi

AMDGPU backend #412