intel / mlir-extensions

Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.
Other
123 stars 44 forks source link

[Triton] Triton generated kernel cannot be load correctly thru the L0 API. #659

Open chengjunlu opened 1 year ago

chengjunlu commented 1 year ago

One very large Triton kernel cannot be load correctly thru the L0 API. Got the error code 0x78000011 from L0 API zeKernelCreate.

ZE_RESULT_ERROR_INVALID_KERNEL_NAME = 0x78000011,   ///< [Validation] kernel name is not found in the module

We double confirmed that the kernel name is used correctly same as the one in the SPIRV IR.

A simple c++ unit test for reproducing this issue. https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-gpu/tree/chengjun/test_dpcpp You can use the following command to build and run the test under the root director of the code:

mkdir build
cd ./build/
cmake ../ -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=dpcpp
make all
./test_void_kernel/triton_void_kernel

On ATSM platform result:

root device count: 2
compile kernel on device: Intel(R) Arc(TM) A770 Graphics
create kernel:triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d
L0 API error code:78000011
silee2 commented 1 year ago

The kernel loaded without error on integrated graphics:

root device count: 1
compile kernel on device: Intel(R) Iris(R) Xe Graphics
triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d
create kernel:triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d
compiled kernel ptr: 0x4dc1cd0
total kernels:1
  kernel:triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d @0x4dc1cd0

My configuration

silee2@silee2-mobl:~/Projects/frameworks.ai.pytorch.ipex-gpu/build [chengjun/test_dpcpp|⚑ 3]$ apt list level-zero
Listing... Done
level-zero/now 1.11.0 amd64 [installed,local]
silee2@silee2-mobl:~/Projects/frameworks.ai.pytorch.ipex-gpu/build [chengjun/test_dpcpp|⚑ 3]$ dpcpp --version
icpx: warning: use of 'dpcpp' is deprecated and will be removed in a future release. Use 'icpx -fsycl' [-Wdeprecated]
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/silee2/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /home/silee2/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg
silee2@silee2-mobl:~/Projects/frameworks.ai.pytorch.ipex-gpu/build [chengjun/test_dpcpp|⚑ 3]$ apt list intel-igc*
Listing... Done
intel-igc-core/now 1.0.14062.11 amd64 [installed,local]
intel-igc-opencl/now 1.0.14062.11 amd64 [installed,local]

iGPU is from i5 11300H [(https://www.intel.com/content/www/us/en/products/sku/196656/intel-core-i511300h-processor-8m-cache-up-to-4-40-ghz-with-ipu/specifications.html)]

chengjunlu commented 1 year ago

The case failed on both ATSM and iGPU on Alderlake.

root device count: 2
compile kernel on device: Intel(R) UHD Graphics 770
create kernel:triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d
L0 API error code:78000011

Here is my configuration:

ii  intel-fw-gpu                               2023.12.2+207                           all          Firmware package for Intel integrated and discrete GPUs
ii  intel-gpu-tools                            1.26-2                                  amd64        tools for debugging the Intel graphics driver
ii  intel-i915-dkms                            1.23.4.15.230307.15.5.17.0.1030+i28-1   all          Out of tree i915 driver for Ubuntu oem kernel version 5.17.
ii  intel-igc-cm                               1.0.176+i600~22.04                      amd64        Intel(R) C for Metal Compiler -- CM Frontend lib
ii  intel-level-zero-gpu                       1.3.26032.26-627~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-media-va-driver-non-free:amd64       23.1.6-622~22.04                        amd64        VAAPI driver for the Intel GEN8+ Graphics family
ii  intel-microcode                            3.20230214.0ubuntu0.22.04.1             amd64        Processor microcode firmware for Intel CPUs
ii  intel-opencl-icd                           23.13.26032.26-627~22.04                amd64        Intel graphics compute runtime for OpenCL
ii  intel-platform-cse-dkms                    2023.11.1-36                            amd64        CSE driver
ii  intel-platform-vsec-dkms                   2023.20.0-3                             amd64        Intel Extended Capabilities auxiliary bus driver
ii  libdrm-intel1:amd64                        2.4.113-2~ubuntu0.22.04.1               amd64        Userspace interface to intel-specific kernel DRM services -- runtime
ii  xserver-xorg-video-intel                   2:2.99.917+git20210115-1                amd64        X.Org X server -- Intel i8xx, i9xx display driver
chengjunlu commented 1 year ago

@silee2 , I find in your log there is the triton__0d1d2d3d4d5d6d7d8d9d10d11d12d13d14d15d16d17d18d19d20d21d22d23d24d25d26d27d28d29d30d31d32d33d34d35d36d37d38d39d40d41d42d43d44d45d46d47d48d49d50d51d52d53d54d55d56d57d58d59d60d61d62d63d64d65d66d67d68d69d70d71d72d73d74d75d76d77d78d79d80d81d82d83d84d85d86d

It means the L0 module has been loaded correctly and we can iterate the kernel in the module.

But in my platform, the L0 module is created without the kernel.