rnwang04 commented 2 months ago

Summary

I found on MTL iGPU, if I call FP16 gemm of onemkl (no matter using OneAPI 2024.0 or 2024.2), the program will crash, and if I call it many times, it will cause my machine to freeze directly. However, on ARC, everything is fine.

Version

oneAPI 2024.0 or oneAPI 2024.2 .

Environment

minimal c++ program
Windows 11
icx 2024.0.2 or 2024.2
Hardware: Intel Core Ultra iGPU

Steps to reproduce

#include <sycl/sycl.hpp>
#include <oneapi/mkl.hpp>
#include <iostream>
using namespace sycl;
int main() {

   queue q{gpu_selector_v};
   std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

   const int M = 1024;
   const int N = 11008;
   const int K = 4096;

   float* A_h = new float[M * K];
   float* B_h = new float[K * N];
   float* C_h = new float[M * N];
   // random
   for (int i = 0; i < M * K; ++i) {
       A_h[i] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
   }
   for (int i = 0; i < K * N; ++i) {
       B_h[i] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
   }
   // convert input to half
   sycl::half* A_h_half = new sycl::half[M * K];
   sycl::half* B_h_half = new sycl::half[K * N];
   for (int i = 0; i < M * K; ++i) {
       A_h_half[i] = sycl::half(A_h[i]);
   }
   for (int i = 0; i < K * N; ++i) {
       B_h_half[i] = sycl::half(B_h[i]);
   }

   buffer<sycl::half> A(A_h_half, M * K);
   buffer<sycl::half> B(B_h_half, K * N);
   buffer<float> C(C_h, M * N);
   // Use OneMKL to do GEMM
   {
       q.submit([&](handler &h) {
           sycl::accessor A_acc(A, h, sycl::write_only, sycl::no_init);
           sycl::accessor B_acc(B, h, sycl::write_only, sycl::no_init);
           sycl::accessor C_acc(C, h, sycl::write_only, sycl::no_init);
           oneapi::mkl::blas::row_major::gemm(
                q,
                oneapi::mkl::transpose::nontrans,
                oneapi::mkl::transpose::trans,
                M, N, K,
                1.0f, A_acc.get_pointer(), K,
                B_acc.get_pointer(), K,
                0.0f, C_acc.get_pointer(), N);
       }).wait();
   }

   delete[] A_h;
   delete[] B_h;
   delete[] C_h;
   delete[] A_h_half;
   delete[] B_h_half;

   printf("run success!\n");
   return 0;
}

running above script with below command:

# for linux
source /opt/intel/oneapi/setvars.sh
icpx -std=c++17 -fsycl -fopenmp -lpthread -l mkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lmkl_sycl_blas -lmkl_intel_ilp64 -lmkl_tbb_thread -o gemm_fp16 gemm_fp16.cpp

# for windows
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" --force
icx -std=c++17 -fsycl -fopenmp mkl_sycl_blas_dll.lib mkl_intel_ilp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib -o gemm_fp16 gemm_fp16.cpp

Observed behavior

If I run above command on Linux Arc A770, it works fine:

If I run above command on Windows MTL iGPU, it fails and even cause a black screen:

Expected behavior

I hope above FP16 GEMM can work for MTL iGPU. Thanks!

andrewtbarker commented 2 months ago

Thanks for reporting this, this is a known issue in the oneMKL backend and should be fixed in the next release of the product.

rnwang04 commented 2 months ago

Hi @andrewtbarker , thanks for quick reply ! I wonder about how long I can obtain such fix ? Next release means OneAPI 2024.3 or smaller product iterations?

sknepper commented 4 weeks ago

Hello @rnwang04 - this has been fixed in the Intel oneMKL 2024.2.1 release, which is publicly available. Thanks!

rnwang04 commented 3 weeks ago

@sknepper Thanks for your great support ! I will have a try later 😊

oneapi-src / oneMKL

MKL FP16 GEMM crash on MTL iGPU #524