Significant underperforming on joint matrix multiplication

Hitman4Reason commented 4 months ago

Describe the bug

The performance from joint_matrix multiplication is around 10 TOPS with int8 on a device with theoretical capability of 383 TOPS. This specific code snippet is for the AMD GPU specified but similar behaviour was observed on NVIDIA 3080. Is this underperforming expected behaviour or is it due do the use of a sub-optimal logic to compute the tiles? Are there examples available in the documentation that extract higher performance?

To reproduce

CODE:

int32_t *out = malloc_device<int32_t>(sizeSquare * sizeSquare, Q);

int8_t *A = malloc_device<int8_t>(sizeSquare * sizeSquare, Q);
int8_t *B = malloc_device<int8_t>(sizeSquare * sizeSquare, Q);

#define SQUARE 32768
constexpr int N_THREADS_PER_MATRIX_OP = 64;

auto Acast = address_space_cast<sycl::access::address_space::global_space,
                                sycl::access::decorated::no>(A);
auto Bcast = address_space_cast<sycl::access::address_space::global_space,
                                sycl::access::decorated::no>(B);
auto outCAst = address_space_cast<sycl::access::address_space::global_space,
                                  sycl::access::decorated::no>(out);

static constexpr size_t M = 16, N = 16, K = 16;

int Sub_Tiles_M = SQUARE / M;
int Sub_Tiles_N = SQUARE / N;
int Sub_Tiles_K = SQUARE / K;

size_t Big_M = M * Sub_Tiles_M, Big_N = N * Sub_Tiles_N, Big_K = K * Sub_Tiles_K;
{

    Q.submit([&](handler &cgh)
             {

  range<2> LocalRange = {1, N_THREADS_PER_MATRIX_OP}; 
  range<2> GlobalRange = {static_cast<size_t>(Sub_Tiles_M), static_cast<size_t>(Sub_Tiles_N * N_THREADS_PER_MATRIX_OP)};

  cgh.parallel_for<>(
      nd_range<2>(GlobalRange, LocalRange), [=](nd_item<2> item) {
        sycl::sub_group sg = item.get_sub_group();
        // row id of current submatrix of BIG C matrix
        const auto m = item.get_group().get_group_id()[0];
        // column id of current submatrix of BIG C matrix
        const auto n = item.get_group().get_group_id()[1];

        joint_matrix<sycl::sub_group, std::remove_const_t<int8_t>, use::a, M, K, layout::col_major> sub_a;

        joint_matrix<sycl::sub_group, std::remove_const_t<int8_t>, use::b, K, N, layout::row_major> sub_b;

        joint_matrix<sycl::sub_group, std::remove_const_t<int32_t>, use::accumulator, M, N> sub_c;

        joint_matrix_fill(sg, sub_c, 0);

        // k = row/col id of current submatrix of BIG A/B matrices
        for (int k = 0; k < Sub_Tiles_K; k++) {
          joint_matrix_load(
              sg, sub_a,
              Acast +
                  (k * K) + (m * M * Big_K),
              Big_K);

          joint_matrix_load(
              sg, sub_b,
              Bcast +
                  (k * K * Big_N) + (n * N),
              Big_N);

          joint_matrix_mad(sg, sub_c, sub_a, sub_b, sub_c);
        }

        joint_matrix_store(
            sg, sub_c,
            outCAst +
                (m * M) * Big_N + n * N,
            Big_N, layout::row_major);
      }); });
    Q.wait();
}

COMPILE: icpx -fsycl -O2 -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -o ver_amd_test.out joint_matrix_hip_gfx90a.cpp

RUN: ONEAPI_DEVICE_SELECTOR="hip:*" SYCL_PI_TRACE=1 ./ver_amd_test.out

The question is if the significant underperforming is expected behaviour or user error.

Environment

OS: SUSE Linux Enterprise Server 15 SP4
AMD MI250x
DPC++ version: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.0 (2024.2.0.20240602)

Additional context

No response

dm-vodopyanov commented 4 months ago

@dkhaldi @YuriPlyakhin, could you please take a look and comment?

dkhaldi commented 4 months ago

@Hitman4Reason, are you interested in optimal performance on on Nvidia GPU or PVC Intel GPU? If the target is Nvidia, did you migrate the SYCL code from the CUDA wmma performance kernel? How does the CUDA code perform? If the target is Intel GPU, I can provide a performance kernel that provides ~185 tflops for 2kx2kx2k GEMM size

YuriPlyakhin commented 4 months ago

@intel/llvm-reviewers-cuda , could you please also take a look and comment?

Hitman4Reason commented 4 months ago

@Hitman4Reason, are you interested in optimal performance on on Nvidia GPU or PVC Intel GPU? If the target is Nvidia, did you migrate the SYCL code from the CUDA wmma performance kernel? How does the CUDA code perform? If the target is Intel GPU, I can provide a performance kernel that provides ~185 tflops for 2kx2kx2k GEMM size

@dkhaldi Ideally I would want that for Nvidia and AMD GPUs. The code I currently run on Nvidia is a simple adaptation of the code found in https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp From that I have attempted various changes in tile processing per thread to have higher reuse but didnt manage higher performance.

dkhaldi commented 4 months ago

AFAIK, https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp is only a functional test. For performance, more blocking needs to happen in the test. The blocking factors should be tuned for Nvidia hardware. Since you are targeting Nvidia, I would recommend migrate code from CUDA to SYCL for a performance GEMM kernel that you know gives acceptable performance to you on Nvidia hardware. You can use SYClomatic tool for that. This should use SYCL joint matrix for the GEMM part. Adding @JackAKirk in case he already did migration from CUDA code for performance. @mehdi-goli, do you happen to have such SYCL joint matrix performant kernels for Nvidia hardware?

Hitman4Reason commented 4 months ago

@dkhaldi Since it seems like no optimal code for nvidia is available, would it be okay if you provided the implementation you offered for Intel GPU and I could try to use that as an alternative starting point to develop for Nvidia? That along with CUDA implementations of gemm should help. Thanks in advance.

dkhaldi commented 4 months ago

Let me just make sure it compiles and runs on Nvidia A100 and then I will send it to you very soon.

dkhaldi commented 4 months ago

@Hitman4Reason, I adapted the Intel GPU performance kernel to work on Nvidia GPUs. I got ~70 Tflops. Please use this repo: https://github.com/dkhaldi/sycl_joint_matrix_kernels/blob/main/joint_matrix_bf16_fill_k_cache.cpp Readme file has information how to compile and run the code. In this kernel, I applied two levels loop tiling and tune the tiling factors a little bit. You can further tune the kernel by:

Apply other tiling factors
Use local memory I would suggest looking at kernels in CUTLASS to borrow some of these exact tuning techniques to get close to the peak you are aiming for.

intel / llvm