Open Hitman4Reason opened 4 months ago
@dkhaldi @YuriPlyakhin, could you please take a look and comment?
@Hitman4Reason, are you interested in optimal performance on on Nvidia GPU or PVC Intel GPU? If the target is Nvidia, did you migrate the SYCL code from the CUDA wmma performance kernel? How does the CUDA code perform? If the target is Intel GPU, I can provide a performance kernel that provides ~185 tflops for 2kx2kx2k GEMM size
@intel/llvm-reviewers-cuda , could you please also take a look and comment?
@Hitman4Reason, are you interested in optimal performance on on Nvidia GPU or PVC Intel GPU? If the target is Nvidia, did you migrate the SYCL code from the CUDA wmma performance kernel? How does the CUDA code perform? If the target is Intel GPU, I can provide a performance kernel that provides ~185 tflops for 2kx2kx2k GEMM size
@dkhaldi Ideally I would want that for Nvidia and AMD GPUs. The code I currently run on Nvidia is a simple adaptation of the code found in https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp From that I have attempted various changes in tile processing per thread to have higher reuse but didnt manage higher performance.
AFAIK, https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp is only a functional test. For performance, more blocking needs to happen in the test. The blocking factors should be tuned for Nvidia hardware. Since you are targeting Nvidia, I would recommend migrate code from CUDA to SYCL for a performance GEMM kernel that you know gives acceptable performance to you on Nvidia hardware. You can use SYClomatic tool for that. This should use SYCL joint matrix for the GEMM part. Adding @JackAKirk in case he already did migration from CUDA code for performance. @mehdi-goli, do you happen to have such SYCL joint matrix performant kernels for Nvidia hardware?
@dkhaldi Since it seems like no optimal code for nvidia is available, would it be okay if you provided the implementation you offered for Intel GPU and I could try to use that as an alternative starting point to develop for Nvidia? That along with CUDA implementations of gemm should help. Thanks in advance.
Let me just make sure it compiles and runs on Nvidia A100 and then I will send it to you very soon.
@Hitman4Reason, I adapted the Intel GPU performance kernel to work on Nvidia GPUs. I got ~70 Tflops. Please use this repo: https://github.com/dkhaldi/sycl_joint_matrix_kernels/blob/main/joint_matrix_bf16_fill_k_cache.cpp Readme file has information how to compile and run the code. In this kernel, I applied two levels loop tiling and tune the tiling factors a little bit. You can further tune the kernel by:
Describe the bug
The performance from joint_matrix multiplication is around 10 TOPS with int8 on a device with theoretical capability of 383 TOPS. This specific code snippet is for the AMD GPU specified but similar behaviour was observed on NVIDIA 3080. Is this underperforming expected behaviour or is it due do the use of a sub-optimal logic to compute the tiles? Are there examples available in the documentation that extract higher performance?
To reproduce
CODE:
COMPILE: icpx -fsycl -O2 -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -o ver_amd_test.out joint_matrix_hip_gfx90a.cpp
RUN: ONEAPI_DEVICE_SELECTOR="hip:*" SYCL_PI_TRACE=1 ./ver_amd_test.out
The question is if the significant underperforming is expected behaviour or user error.
Environment
Additional context
No response