intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 735 forks source link

Segfault building pytorch sycl kernels with intel/llvm: binary Instruction seen with illegal int type #15082

Closed dvrogozh closed 1 month ago

dvrogozh commented 2 months ago

I am trying to use intel/llvm instead of dpc++ compiler to build pytorch XPU backend which has sycl kernels. There are few issues met with this effort which I have hacks/workarounds for. However, I do see segmentation faults in ocloc compiler on device linkage stage for xe-lpg building the following 2 kernels:

The error being printed is:

Binary Instruction seen with illegal int type. Legalization support missing. Inst opcode:25[0]: /lib/x86_64-linux-gnu/libocloc.so(+0xc1a64) [0x7fcca7221a64]

Note that dpc++ compiler version 2024.1 (officially used for pytorch build following https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html) can build these kernels successfully. So, dpc++ 2024.1 and intel/llvm work against same ocloc version and gpu stack on my system with the latter having segfault. Based on that I think debug should be started on intel/llvm level. Note also that I worry that dpc++ 2025 will have the same issue - this compiler is not currently verified to work for pytorch xpu backend.

Call stack with the issue:

icpx -fPIC -fsycl -fpreview-breaking-changes -fsycl-targets=spir64_gen,spir64 -fno-sycl-unnamed-lambda -sycl-std=2020 -fhonor-nans -fhonor-infinities -fno-associative-math -fno-approx-func -Wno-absolute-value -D__INTEL_PREVIEW_BREAKING_CHANGES -D_GLIBCXX_USE_CXX11_ABI=1 -fsycl-fp64-conv-emu -fsycl-max-parallel-link-jobs=208 -fsycl-targets=spir64_gen,spir64 -fsycl-link out.o -Xs -device\ xe-lpg\ -options\ '\ -cl-poison-unsupported-fp64-kernels\ -cl-intel-enable-auto-large-GRF-mode\ -cl-fp32-correctly-rounded-divide-sqrt' -o a.o
llvm-foreach: adjusted number of threads to 160 (max safe available).
llvm-foreach: adjusted number of threads to 160 (max safe available).
Compilation from IR - skipping loading of FCL
Compilation from IR - skipping loading of FCL

warning: kernel _ZTSN2at6native3xpu12ReduceKernelILi1ENS1_8ReduceOpIN3c104HalfENS1_9ArgMaxOpsIfEEjlLi4EEEEE  compiled SIMD8 allocated 128 regs and spilled around 8

warning: kernel _ZTSN2at6native3xpu12ReduceKernelILi4ENS1_8ReduceOpIN3c104HalfENS1_9ArgMaxOpsIfEEjlLi4EEEEE  compiled SIMD8 allocated 128 regs and spilled around 54

Build succeeded for : arl-s.
Compilation from IR - skipping loading of FCL

warning: kernel _ZTSN2at6native3xpu12ReduceKernelILi1ENS1_8ReduceOpIN3c104HalfENS1_9ArgMaxOpsIfEEjlLi4EEEEE  compiled SIMD8 allocated 128 regs and spilled around 8

warning: kernel _ZTSN2at6native3xpu12ReduceKernelILi4ENS1_8ReduceOpIN3c104HalfENS1_9ArgMaxOpsIfEEjlLi4EEEEE  compiled SIMD8 allocated 128 regs and spilled around 54

Build succeeded for : mtl-h.
Binary Instruction seen with illegal int type. Legalization support missing. Inst opcode:25[0]: /lib/x86_64-linux-gnu/libocloc.so(+0xc1a64) [0x7fcca7221a64]
[1]: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fcca6f79520]
[2]: /lib/x86_64-linux-gnu/libigc.so.1(+0x96912f) [0x7fcca21a712f]
[3]: /lib/x86_64-linux-gnu/libigc.so.1(+0xd014c9) [0x7fcca253f4c9]
[4]: /lib/x86_64-linux-gnu/libigc.so.1(+0xd08bad) [0x7fcca2546bad]
[5]: /lib/x86_64-linux-gnu/libigc.so.1(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x2be) [0x7fcca2fe51ae]
[6]: /lib/x86_64-linux-gnu/libigc.so.1(_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE+0x34) [0x7fcca2fe54d4]
[7]: /lib/x86_64-linux-gnu/libigc.so.1(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x32c) [0x7fcca2fe626c]
[8]: /lib/x86_64-linux-gnu/libigc.so.1(+0xc861b2) [0x7fcca24c41b2]
[9]: /lib/x86_64-linux-gnu/libigc.so.1(+0x90a55e) [0x7fcca214855e]
[10]: /lib/x86_64-linux-gnu/libigc.so.1(+0xb6b61b) [0x7fcca23a961b]
[11]: /lib/x86_64-linux-gnu/libigc.so.1(+0x90cf27) [0x7fcca214af27]
[12]: /lib/x86_64-linux-gnu/libigc.so.1(+0x984ccd) [0x7fcca21c2ccd]
[13]: /lib/x86_64-linux-gnu/libigc.so.1(+0x9861de) [0x7fcca21c41de]
[14]: /lib/x86_64-linux-gnu/libocloc.so(+0x9a386) [0x7fcca71fa386]
[15]: /lib/x86_64-linux-gnu/libocloc.so(+0xc3acf) [0x7fcca7223acf]
[16]: /lib/x86_64-linux-gnu/libocloc.so(+0xc1cc8) [0x7fcca7221cc8]
[17]: /lib/x86_64-linux-gnu/libocloc.so(+0x89ca9) [0x7fcca71e9ca9]
[18]: /lib/x86_64-linux-gnu/libocloc.so(oclocInvoke+0x8ee) [0x7fcca71eb81e]
[19]: /usr/bin/ocloc(+0x637) [0x5610826ac637]
[20]: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fcca6f60d90]
[21]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fcca6f60e40]
[22]: /usr/bin/ocloc(+0x665) [0x5610826ac665]
llvm-foreach: Segmentation fault (core dumped)
icpx: error: gen compiler command failed with exit code 254 (use -v to see invocation)
clang version 19.0.0git (https://github.com/intel/llvm.git e16b0a434b14089140e4bd76f27adc18c9d782ae)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/dvrogozh/git/llvm/build/install/bin
Build config: +assertions
icpx: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.

Easier reproducer

  1. Build intel/llvm
  2. Get pre-compiled faulty kernel: https://github.com/dvrogozh/pytorch/blob/intel-llvm/ReduceArgMaxKernel_preproc.ii
  3. Try to link, this should reproduce the failure (I reduced link options to minimal):
    icpx -fsycl -sycl-std=2020 -fsycl-targets=spir64_gen,spir64 -fsycl-link ReduceArgMaxKernel_preproc.ii -Xs "-device xe-lpg" -o a.o

Full reproducer

Additional effort will be needed to simplify reproducer. Below are current reproduce steps which assumes building pytorch xpu backend.

Once above steps to build pytorch are done, it's possible to run these 2 commands to reproduce the issue (they require some generated files from overall pytorch build and can't be run beforehand). Note that compilation step uses -fsycl-host-compiler=g++ - that's a way pytorch xpu is being built in general.

# compile step
/home/dvrogozh/git/install/bin/icpx -MD -MF deps.o.SYCL-depend -c /home/dvrogozh/git/pytorch/pytorch-clang/third_party/torch-xpu-ops/src/ATen/native/xpu/sycl/ReduceArgMaxKernel.cpp -o out.o -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/build/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build -I/home/dvrogozh/git/pytorch/pytorch-clang -I/home/dvrogozh/git/pytorch/pytorch-clang/build/third_party/gloo -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/gloo -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/tensorpipe/third_party/libuv/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/googletest/googlemock/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/googletest/googletest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/protobuf/src -I/opt/intel/oneapi/mkl/latest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/XNNPACK/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/benchmark/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ittapi/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/eigen -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/onnx -I/home/dvrogozh/git/pytorch/pytorch-clang/build/third_party/onnx -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/opt/intel/oneapi/mkl/latest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/nlohmann -I/home/dvrogozh/git/pytorch/pytorch-clang/INTERFACE -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/nlohmann/include -I/home/dvrogozh/git/pytorch/pytorch-clang/torch/csrc/api -I/home/dvrogozh/git/pytorch/pytorch-clang/torch/csrc/api/include -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build/caffe2/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/.. -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/miniz-2.1.0 -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu/detail -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include -I/home/dvrogozh/git/pytorch/pytorch-clang/build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj-build/include -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu/detail -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include -I/home/dvrogozh/git/pytorch/pytorch-clang/build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj-build/include -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/torch-xpu-ops/src -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/install/include/sycl -fsycl-host-compiler=/usr/bin/c++ "-fsycl-host-compiler-options=-I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/build/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build -I/home/dvrogozh/git/pytorch/pytorch-clang -I/home/dvrogozh/git/pytorch/pytorch-clang/build/third_party/gloo -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/gloo -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/tensorpipe/third_party/libuv/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/googletest/googlemock/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/googletest/googletest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/protobuf/src -I/opt/intel/oneapi/mkl/latest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/XNNPACK/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/benchmark/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ittapi/include -I/home/dvrogozh/git/pytorch/pytorch-clang/cmake/../third_party/eigen -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/onnx -I/home/dvrogozh/git/pytorch/pytorch-clang/build/third_party/onnx -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/include -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/opt/intel/oneapi/mkl/latest/include -I/home/dvrogozh/git/pytorch/pytorch-clang/nlohmann -I/home/dvrogozh/git/pytorch/pytorch-clang/INTERFACE -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/nlohmann/include -I/home/dvrogozh/git/pytorch/pytorch-clang/torch/csrc/api -I/home/dvrogozh/git/pytorch/pytorch-clang/torch/csrc/api/include -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build/caffe2/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/build/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/.. -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/miniz-2.1.0 -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu/detail -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include -I/home/dvrogozh/git/pytorch/pytorch-clang/build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj-build/include -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/native/mkldnn/xpu/detail -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/ideep/mkl-dnn/include -I/home/dvrogozh/git/pytorch/pytorch-clang/build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj-build/include -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/pytorch/pytorch-clang/aten/src/ATen/xpu -I/home/dvrogozh/git/pytorch/pytorch-clang/third_party/torch-xpu-ops/src -I/home/dvrogozh/git/install/include -I/home/dvrogozh/git/install/include/sycl -I/home/dvrogozh/git/install/include/sycl -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -D__INTEL_PREVIEW_BREAKING_CHANGES -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DUSE_XPU -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -std=c++17 -Wno-deprecated-declarations -Wno-deprecated -Wno-attributes -Wno-sign-compare -DONNX_ML=1 -DONNXIFI_ENABLE_EXT=1 -DONNX_NAMESPACE=onnx_torch -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DUSE_EXTERNAL_MZCRC -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DFLASHATTENTION_DISABLE_ALIBI " -fsycl -fpreview-breaking-changes -fsycl-targets=spir64_gen,spir64 -fno-sycl-unnamed-lambda -sycl-std=2020 -fhonor-nans -fhonor-infinities -fno-associative-math -fno-approx-func -Wno-absolute-value -D__INTEL_PREVIEW_BREAKING_CHANGES -D_GLIBCXX_USE_CXX11_ABI=1 -fsycl-fp64-conv-emu -DONNX_ML=1 -DONNXIFI_ENABLE_EXT=1 -DONNX_NAMESPACE=onnx_torch -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DUSE_EXTERNAL_MZCRC -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DFLASHATTENTION_DISABLE_ALIBI

# device link step
icpx -fPIC -fsycl -fpreview-breaking-changes -fsycl-targets=spir64_gen,spir64 -fno-sycl-unnamed-lambda -sycl-std=2020 -fhonor-nans -fhonor-infinities -fno-associative-math -fno-approx-func -Wno-absolute-value -D__INTEL_PREVIEW_BREAKING_CHANGES -D_GLIBCXX_USE_CXX11_ABI=1 -fsycl-fp64-conv-emu -fsycl-max-parallel-link-jobs=208 -fsycl-targets=spir64_gen,spir64 -fsycl-link out.o -Xs -device\ pvc\ -options\ '\ -cl-poison-unsupported-fp64-kernels\ -cl-intel-enable-auto-large-GRF-mode\ -cl-fp32-correctly-rounded-divide-sqrt' -o a.o

Observations:

cc: @mdtoguchi, @paigeale

dvrogozh commented 2 months ago

Driver stack versions on my side:

$ apt-cache show libigc1 | grep Version | head -1
Version: 1.0.17193.16-950~22.04

$ apt-cache show libigdfcl1 | grep Version | head -1
Version: 1.0.17193.16-950~22.04

$ apt-cache show intel-opencl-icd | grep Version | head -1
Version: 24.26.30049.10-950~22.04

$ apt-cache show level-zero | grep Version | head -1
Version: 1.16.15-881~22.04

$ apt-cache show intel-level-zero-gpu | grep Version | head -1
Version: 1.3.30049.10-950~22.04
mdtoguchi commented 2 months ago

A couple of items that should be noted here that allows for reproduction of the issue using a recent intel/llvm based compiler

dvrogozh commented 2 months ago

Debug observations on my side:

  1. As you can see generated kernel is a "switch" kernel where multiple data types are handled. So, issue happens on the handling of only uint8_t type. I.e. on a call to argmax_kernel_impl<uint8_t>(iter).
  2. I don't see the issue after forcing noinline for one of the functions used in the kernel definition. See below

Patch:

--- a/src/ATen/native/xpu/sycl/SharedReduceOps.h
+++ b/src/ATen/native/xpu/sycl/SharedReduceOps.h
@@ -349,6 +349,7 @@ struct MinMaxReductionOps {
     return comp_t{}(a.first, b.first, a.second, b.second) ? a : b;
   }

+  __attribute__((noinline))
   static arg_t translate_idx(arg_t a, int64_t base_idx) {
     return {a.first, a.second + base_idx};
   }

This function: https://github.com/intel/torch-xpu-ops/blob/13955ba5c9116ee5085fb0e4840aabe3d8f2fab4/src/ATen/native/xpu/sycl/SharedReduceOps.h#L352

Called from: https://github.com/intel/torch-xpu-ops/blob/13955ba5c9116ee5085fb0e4840aabe3d8f2fab4/src/ATen/native/xpu/sycl/Reduce.h#L999

bader commented 2 months ago

According to the call stack, the crash happens in IGC compiler, which is being developed in https://github.com/intel/intel-graphics-compiler/. @dvrogozh, did you report this issue to the IGC team?

dvrogozh commented 2 months ago

@dvrogozh, did you report this issue to the IGC team?

No. That's up to intel/llvm team to do so. However, I am talking to IGC team right now and update if there will be any findings.

dvrogozh commented 2 months ago

@paigeale from IGC team has helped to debug the issue and create IGC-level reproducer. This seems to be IGC side bug, so I have filed https://github.com/intel/intel-graphics-compiler/issues/340.

bader commented 2 months ago

Great. @dvrogozh, I propose that we close this issue and monitor the progress through the IGC issue. Does that plan work for you?

dvrogozh commented 2 months ago

Let's wait couple days to see IGC issue processed. I hope to get PR with the fix from them.

dvrogozh commented 1 month ago

Fixed by https://github.com/intel/intel-graphics-compiler/commit/66d001e52c8e496f51c2572acc2377ca8f4e9e50