lamikr / rocm_sdk_builder

Other
113 stars 8 forks source link

How does rocWMMA/hipBLASlt/etc work on gfx103x? #1

Closed AngryLoki closed 1 week ago

AngryLoki commented 1 month ago

Hi, Gentoo enthusiast here,

Did it make any sense to add gfx1030 and gfx1035 to rocWMMA/hipBLASlt? As far as I understand their code, it contains hard dependencies in WMMA or MFMA instruction set, and RDNA2 (gfx103x) supports neither of them: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX1030.html . This means that at least tests will fail (if it even compiles).

Additionally, I noticed in other libraries like rocFFT code like #if defined(__gfx803__)|| defined(__gfx900__) || ... which restricts execution code paths to specific models, causing crashes if you blindly add gfx103x to AMDGPU_TARGETS. I mentioned about it earlier on https://github.com/gentoo/gentoo/pull/33400#issuecomment-1826426048 and the solution I proposed is to build for officially supported target and patch runtime so that it tries to load compatible kernels.

lamikr commented 1 month ago

hipBLASLt, rocWMMA, composable kernel and aotriton have things which are not fully compatible with older GPUs like gfx101x and gfx103x. So some of them build acceleration only for MI-series of gpus or gfx11-series.

I think composable kernel is in best shape, jit was available on another branch, so I merged that to newer version and updated cmakefiles. It needs to be build separately so that's why there are two composable kernel builds done. It tooks really long time to build and run all composable kernel test but I have done it. There were couple of tests which were failing at least on gfx1010 because it did not support all the gpu instructions. I have some half finished patches for those somewhere...

hipBLASLt's tensorlite would need at least a similar type of fallback support than is in hipBLAS to get some acceleration. At the moment it contains these Gridbased logic files only for Navi31 and MI-series of gpus. Tensile/tuning_doc folder contains documentation how to generate the logic files, so maybe would generate gfx10 compatible kernels and config files the one also for hipBLASLt. Not really sure whether it makes sense and not really sure where the hipBLASLt is really needed at the moment. Do you remember seeing documentation about possibility to configure some ROCm parts to use hipBLASLt instead of hipBLAS by changing environment variable?

rocWMMA I have not had time to look for a while.

I have added one line printf warning to ROCm runtime when it's fails to load CO and can not find it for the GPU in use.

lamikr commented 1 week ago

Just an extra note that most of the #ifdef gfx1030 code in applications are made in a way that if the specific instructions like wmma are available then the code is optimized by using those assembly optimizations and for gpus missing those instructions, same functionality is build by using some other maybe-not-so-fast instructions.

I think this needs to be addressed more like a case by case. So if you detect that some function is not working on certain gpu, we could then try to find a way to build that module in a way that it will work also on that gpu.