Closed dvrogozh closed 1 month ago
Driver stack versions on my side:
$ apt-cache show libigc1 | grep Version | head -1
Version: 1.0.17193.16-950~22.04
$ apt-cache show libigdfcl1 | grep Version | head -1
Version: 1.0.17193.16-950~22.04
$ apt-cache show intel-opencl-icd | grep Version | head -1
Version: 24.26.30049.10-950~22.04
$ apt-cache show level-zero | grep Version | head -1
Version: 1.16.15-881~22.04
$ apt-cache show intel-level-zero-gpu | grep Version | head -1
Version: 1.3.30049.10-950~22.04
A couple of items that should be noted here that allows for reproduction of the issue using a recent intel/llvm based compiler
icpx
here is not the DPC++ compiler, but rather a symlink to using clang++
-fsycl-host-compiler=g++
, as the host code will not compile with clang.Debug observations on my side:
uint8_t
type. I.e. on a call to argmax_kernel_impl<uint8_t>(iter)
.noinline
for one of the functions used in the kernel definition. See belowPatch:
--- a/src/ATen/native/xpu/sycl/SharedReduceOps.h
+++ b/src/ATen/native/xpu/sycl/SharedReduceOps.h
@@ -349,6 +349,7 @@ struct MinMaxReductionOps {
return comp_t{}(a.first, b.first, a.second, b.second) ? a : b;
}
+ __attribute__((noinline))
static arg_t translate_idx(arg_t a, int64_t base_idx) {
return {a.first, a.second + base_idx};
}
This function: https://github.com/intel/torch-xpu-ops/blob/13955ba5c9116ee5085fb0e4840aabe3d8f2fab4/src/ATen/native/xpu/sycl/SharedReduceOps.h#L352
According to the call stack, the crash happens in IGC compiler, which is being developed in https://github.com/intel/intel-graphics-compiler/. @dvrogozh, did you report this issue to the IGC team?
@dvrogozh, did you report this issue to the IGC team?
No. That's up to intel/llvm team to do so. However, I am talking to IGC team right now and update if there will be any findings.
@paigeale from IGC team has helped to debug the issue and create IGC-level reproducer. This seems to be IGC side bug, so I have filed https://github.com/intel/intel-graphics-compiler/issues/340.
Great. @dvrogozh, I propose that we close this issue and monitor the progress through the IGC issue. Does that plan work for you?
Let's wait couple days to see IGC issue processed. I hope to get PR with the fix from them.
I am trying to use intel/llvm instead of dpc++ compiler to build pytorch XPU backend which has sycl kernels. There are few issues met with this effort which I have hacks/workarounds for. However, I do see segmentation faults in ocloc compiler on device linkage stage for
xe-lpg
building the following 2 kernels:The error being printed is:
Note that dpc++ compiler version 2024.1 (officially used for pytorch build following https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html) can build these kernels successfully. So, dpc++ 2024.1 and intel/llvm work against same ocloc version and gpu stack on my system with the latter having segfault. Based on that I think debug should be started on intel/llvm level. Note also that I worry that dpc++ 2025 will have the same issue - this compiler is not currently verified to work for pytorch xpu backend.
Call stack with the issue:
Easier reproducer
Full reproducer
Additional effort will be needed to simplify reproducer. Below are current reproduce steps which assumes building pytorch xpu backend.
python3 -m venv ~/pytorch.xpu
Once above steps to build pytorch are done, it's possible to run these 2 commands to reproduce the issue (they require some generated files from overall pytorch build and can't be run beforehand). Note that compilation step uses
-fsycl-host-compiler=g++
- that's a way pytorch xpu is being built in general.Observations:
-device pvc
works fine, but-device xe-lpg
fails with segfaultcc: @mdtoguchi, @paigeale