intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.16k stars 234 forks source link

LuxMark 3.1 segfaults on Skylake IGP #218

Closed aufkrawall closed 4 years ago

aufkrawall commented 5 years ago

When I try to run LuxMark 3.1 on the i7 6700k IGP, it crashes:

./luxmark free(): invalid pointer ./luxmark: line 12: 59158 Aborted (core dumped) ./luxmark.bin "$@"

This is on Arch with linux 5.3.4 intel-compute-runtime 19.37.14191 intel-graphics-compiler 1:1.0.11

clinfo.log

IGP is used only for CL, display output and desktop rendering runs via dGPU (Polaris). On Polaris, LuxMark works with Clover, OCL-Orca- OCL-PAL and OCL-ROCm.

JacekDanecki commented 5 years ago

Can you provide stack trace from gdb?

aufkrawall commented 5 years ago

I'm on Arch, I'd have to recompile with debug symbols. Is it enough when intel-compute-runtime includes them?

jdanecki commented 5 years ago

Let's start with luxmark itself. Can you verify whether there was gpu hang in dmesg?

jdanecki commented 5 years ago

I've reproduced this issue with Luxmark 3.1 with:

$ pacman -Q intel-compute-runtime intel-graphics-compiler intel-opencl-clang 
intel-compute-runtime 19.39.14278-1
intel-graphics-compiler 1:1.0.2652-1
intel-opencl-clang 9.0.0-1

under Arch

[New Thread 0x7fffacff9700 (LWP 20472)]
free(): invalid pointer

Thread 3 "luxmark.bin" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff51d2700 (LWP 20391)]
0x00007ffff718af25 in raise () from /usr/lib/libc.so.6
(gdb) where
#0  0x00007ffff718af25 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7174897 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff71ce258 in __libc_message () from /usr/lib/libc.so.6
#3  0x00007ffff71d577a in malloc_printerr () from /usr/lib/libc.so.6
#4  0x00007ffff71d714c in _int_free () from /usr/lib/libc.so.6
#5  0x00000000015474af in std::locale::_Impl::~_Impl() ()
#6  0x000000000154768d in std::locale::~locale() ()
#7  0x00007ffff70992c4 in std::basic_streambuf<char, std::char_traits<char> >::~basic_streambuf (this=0x7ffff51cf638, __in_chrg=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/streambuf:204
#8  std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf (this=0x7ffff51cf638, __in_chrg=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/sstream:65
#9  std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::~basic_stringstream (this=0x7ffff51cf620, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/inclu
de/sstream:784
#10 0x00007fffec680071 in ?? () from /usr/lib/libopencl-clang.so.9
#11 0x0000000000000070 in ?? ()
#12 0x00007ffff51cf610 in ?? ()
#13 0x00007fffebb38560 in ?? ()
#14 0x00007fffe96233e8 in ?? ()
#15 0x00007ffff51cf5b0 in ?? ()
#16 0x00007ffff51cf6a0 in ?? ()
#17 0x00007ffff51cf670 in ?? ()
#18 0x0000000000000012 in ?? ()
#19 0x00007fffebe97488 in ?? ()
#20 0x0000000000000003 in ?? ()
#21 0x0000000000000000 in ?? ()

The same luxmark binaries together with Neo

dpkg -l intel-opencl intel-igc-core intel-igc-opencl | grep ^ii
ii  intel-igc-core   1.0.2597     amd64        Intel(R) Graphics Compiler for OpenCL(TM)
ii  intel-igc-opencl 1.0.2597     amd64        Intel(R) Graphics Compiler for OpenCL(TM)
ii  intel-opencl     19.40.14409  amd64        Intel OpenCL GPU driver

works fine under Ubuntu 18.04.3 LTS. Tested with kernel: 5.3.1-x86_64

@dbermond Have you observed issues with intel-opencl-clang under Arch in other apps?

dbermond commented 5 years ago

@jdanecki Nothing that I can observe or be aware of.

What we are experiencing is a build issue with neo 19.40, apparently with gen12 related code.

JacekDanecki commented 5 years ago

@dbermond Can you provide more details about build problems, or create new issue here? Is it a similar issue like here. If so, this is known issue on IGC side, and fix is in progress. When you get older IGC, the same as in Neo release 19.40, plus fix for gcc 9 compilation intel/intel-graphics-compiler@028414b376d12d7d6fbb4939bca2a31a02b6a18f, you will be able to compile Neo correctly. With this newer IGC commit Neo compiles correctly.

dbermond commented 5 years ago

@JacekDanecki That's exactly this issue. Glad to see that a fix is under way.

aufkrawall commented 5 years ago

Still crashes with intel-compute-runtime 19.40.14409-1 intel-graphics-compiler 1:1.0.2714-1

Stack trace (no idea how helpful it is without debug symbols): https://drive.google.com/open?id=1ZaPDnmG4_vRN-4rZnI8agYzJmzhrYylp

JacekDanecki commented 5 years ago

@aufkrawall As the issue is observed in intel-opecl-clang library I've compiled:

@dbermond Is it possible to downgrade both packages in Arch, so they work correctly with Neo? Actually only spirv-llvm-translator downgrade is requried, and opencl-clang rebuild.

dbermond commented 5 years ago

@JacekDanecki I heavily appreciate your effort in helping to solve this issue in Arch Linux (and I'm sure you know it), but as a general rule we do not downgrade repository packages in such situation. We only downgrade when a package is utterly broken. If there is a patch that we can use, then I would gladly apply it.

Besides, even if downgrading would be an option, I would like to mention that the OP reported this issue with neo 19.37 and igc 1:1.0.11 at the time of his writing. These were already based on the previous versions of intel-opencl-clang (8.0.1) and spirv-llvm-translator (8.0.1.2).

JacekDanecki commented 5 years ago

Neo release 19.38.14237 is the latest release based on llvm/clang 8 on IGC side. Since intel/intel-graphics-compiler@7117adbaa6a5ffa055388251dcc2f8ae9e0a0851 IGC switched to newer intel/opencl-clang@v9.0.0 and KhronosGroup/SPIRV-LLVM-Translator@v9.0.0-1. Unfortunately these versions doesn't work in luxmark and debug is in progress.

JacekDanecki commented 5 years ago

When I removed directory /tmp/kernel_cache/LUXCORE_1.5, all scenes work correctly with IGC, opencl-clang, spirv-llvm-translator I mentioned earlier. I've checked it with the latest Neo intel/compute-runtime@bfc98631

JacekDanecki commented 5 years ago

New Neo release 19.43.14583 contains intel/intel-graphics-compiler@igc-1.0.2714.1 compiled with intel/opencl-clang@v9.0.0 and KhronosGroup/SPIRV-LLVM-Translator@v9.0.0-1. These binaries works with Luxmark. I've recompiled Neo and IGC components using these versions and Luxmark works under Arch.

dbermond commented 5 years ago

@JacekDanecki Thanks for working on this.

aufkrawall commented 5 years ago

@dbermond Is there some action required for the Arch packaging? It still crashes for me with the same build versions in the Arch repo as outlined by JacekDanecki.

JacekDanecki commented 5 years ago

@aufkrawall It crashes when I use new Arch packages too. @dbermond Here are steps I used to build whole Neo stack under Arch. Luxmark works with binaries created this way.

export llvm_commit=llvmorg-9.0.0
export opencl_clang_commit=9.0.0
export spirv_llvm_translator_commit=9.0.0-1
export llvm_patches_commit=1c93162ab33af968c22fe1cbfb12ea87f5a25bfa
export igc_commit=igc-1.0.2714.1
export neo_commit=19.43.14583
export gmmlib_commit=19.3.2

wget --no-check-certificate https://github.com/intel/gmmlib/archive/intel-gmmlib-${gmmlib_commit}.tar.gz
wget --no-check-certificate https://github.com/llvm/llvm-project/archive/${llvm_commit}/llvm-${llvm_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/opencl-clang/archive/v${opencl_clang_commit}/opencl-clang-${opencl_clang_commit}.tar.gz
wget --no-check-certificate https://github.com/KhronosGroup/SPIRV-LLVM-Translator/archive/v${spirv_llvm_translator_commit}/spirv-llvm-translator-${spirv_llvm_translator_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/llvm-patches/archive/${llvm_patches_commit}/llvm-patches-${llvm_patches_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/intel-graphics-compiler/archive/${igc_commit}/igc-${igc_commit}.tar.gz

tar -xzf intel-gmmlib-${gmmlib_commit}.tar.gz
pushd gmmlib-intel-gmmlib-${gmmlib_commit}
mkdir build
pushd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DBUILD_TYPE=release -DRUN_TEST_SUITE:BOOL='OFF' -Wno-dev 
make -j 10
make -j 10 install
popd
popd

tar -xzf llvm-${llvm_commit}.tar.gz
ln -s llvm-project-${llvm_commit} llvm-project
tar -xzf opencl-clang-${opencl_clang_commit}.tar.gz
pushd llvm-project/llvm/projects
ln -s ../../../opencl-clang-${opencl_clang_commit} opencl-clang
popd
tar -xzf spirv-llvm-translator-${spirv_llvm_translator_commit}.tar.gz
pushd llvm-project/llvm/projects
ln -s ../../../SPIRV-LLVM-Translator-${spirv_llvm_translator_commit} llvm-spirv
popd
tar -xzf llvm-patches-${llvm_patches_commit}.tar.gz
ln -s llvm-patches-${llvm_patches_commit} llvm_patches
tar -xzf igc-${igc_commit}.tar.gz
ln -s intel-graphics-compiler-${igc_commit} igc
mv llvm-project/clang llvm-project/llvm/tools/

mkdir build
pushd build
cmake ../igc/IGC -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev
make -j 10
make -j 10 install
popd

mkdir neo
cd neo
wget --no-check-certificate https://github.com/intel/compute-runtime/archive/${neo_commit}/neo_${neo_commit}.tar.gz
tar -xzf neo_${neo_commit}.tar.gz
mkdir -p compute-runtime-${neo_commit}/build
pushd compute-runtime-${neo_commit}/build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev  -DSKIP_UNIT_TESTS=1
make -j 10 
make -j 10  install
popd

I'll rebuild IGC components with llvm/clang binaries provided in Arch to check how they work with Luxmark.

JacekDanecki commented 5 years ago

@dbermond When I rebuilt spirv-llvm-translator, opencl-clang, igc with llvm/clang binaries from Arch, there is abort in luxmark

./luxmark
free(): invalid pointer
./luxmark: line 12:   822 Aborted                 (core dumped) ./luxmark.bin "$@"
JacekDanecki commented 5 years ago

I've found workaround. If you build spirv-llvm-translator with -DCMAKE_BUILD_TYPE=Debug, and recompile intel-opencl-clang, luxmark will work correctly. Both components can be compiled with system llvm/clang.

aufkrawall commented 5 years ago

I've added -DCMAKE_BUILD_TYPE=Debug \ to the spirv-llvm-translator PKGBUILD section before -Wno-dev, compiled & installed it, then compiled & installed intel-opencl-clang, but luxmark still crashes:

./luxmark free(): invalid pointer ./luxmark: line 12: 117014 Aborted (core dumped) ./luxmark.bin "$@"

jdanecki commented 5 years ago

Here is a script I used to rebuild spirv-llvm-translator and intel-opencl-clang. Before I run it luxmark crashed, after script execution luxmark works. Are you using different versions or cmake parameters?

wget https://github.com/KhronosGroup/SPIRV-LLVM-Translator/archive/v9.0.0-1.tar.gz
tar -xzf v9.0.0-1.tar.gz
pushd SPIRV-LLVM-Translator-9.0.0-1
mkdir build
pushd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Debug -Wno-dev -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j 10
make DESTDIR=install install
popd
popd

wget https://github.com/intel/opencl-clang/archive/v9.0.0.tar.gz
tar -xzf v9.0.0.tar.gz
pushd opencl-clang-9.0.0
mkdir build
pushd build

cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev -DLLVMSPIRV_INCLUDED_IN_LLVM=OFF -DSPIRV_TRANSLATOR_DIR=`pwd`/../../SPIRV-LLVM-Translator-9.0.0-1/build/install/usr
make -j 10
make DESTDIR=install install
cp install/usr/lib/libopencl-clang.so.9 /usr/lib/libopencl-clang.so.9
dbermond commented 5 years ago

@aufkrawall Add options=('!strip') to the PKGBUILD when using -DCMAKE_BUILD_TYPE=Debug so the debug symbols are not stripped from the elf files on the package. Probably you will need to use it in intel-opencl-clang too because spirv-llvm-translator ships a static library.

aufkrawall commented 5 years ago

I did that and, as expected, the packages have massively grown in size. But it still crashes regardless:

free(): invalid pointer
./luxmark: line 12: 227167 Aborted                 (core dumped) ./luxmark.bin "$@"
JacekDanecki commented 5 years ago

The issue is with default flags set in /etc/makepkg.conf. When I removed flag -O2

--- makepkg.conf-orig   2019-11-04 17:27:50.929364959 +0100
+++ makepkg.conf        2019-11-04 17:53:29.381504944 +0100
@@ -38,7 +38,7 @@
 #-- Compiler and Linker Flags
 CPPFLAGS="-D_FORTIFY_SOURCE=2"
 CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt"
-CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt"
+CXXFLAGS="-march=x86-64 -mtune=generic -pipe -fstack-protector-strong -fno-plt"
 LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now"
 #-- Make Flags: change this for DistCC/SMP systems
 MAKEFLAGS="-j`nproc`"

built and install SPIRV-LLVM-Translator as Debug and opencl-clang as Release using pacman, luxmark started to work correctly.

Here are changes I've made in PKGBUILD files

diff -Nurp spirv-llvm-translator-orig/PKGBUILD spirv-llvm-translator/PKGBUILD
--- spirv-llvm-translator-orig/PKGBUILD 2019-11-04 17:13:33.000000000 +0100
+++ spirv-llvm-translator/PKGBUILD      2019-11-04 17:21:57.937332840 +0100
@@ -22,6 +22,7 @@ build() {
     cmake ../${_srcname}-${pkgver%.*}-${_build} \
         -DCMAKE_INSTALL_PREFIX=/usr \
         -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
+        -DCMAKE_BUILD_TYPE=Debug \
         -Wno-dev
     make
 }

and

diff -Nurp intel-opencl-clang-orig/PKGBUILD intel-opencl-clang/PKGBUILD
--- intel-opencl-clang-orig/PKGBUILD    2019-11-04 17:12:31.000000000 +0100
+++ intel-opencl-clang/PKGBUILD 2019-11-04 17:26:23.834357034 +0100
@@ -24,7 +24,7 @@ build() {
         -DCMAKE_INSTALL_PREFIX=/usr \
         -DLLVMSPIRV_INCLUDED_IN_LLVM=OFF \
         -DSPIRV_TRANSLATOR_DIR=/usr \
-        -DLLVM_NO_DEAD_STRIP=ON \
+        -DCMAKE_BUILD_TYPE=Release \
         -Wno-dev
     make
 }
aufkrawall commented 5 years ago

It finally works: Screenshot_20191104_183415

Thanks a lot for your efforts, really appreciated!

@dbermond Would it be possible to ship the packages like this in Arch repo?

Edit: Well, I guess it would also be nice if it didn't require special treatment at compile time vs. other packages.

aufkrawall commented 5 years ago

@JacekDanecki Is it expected to have two CPU threads being fully utilized while Luxmark runs on the IGP? There is 12.25% total load by the luxmark process itself and another 12.25% not linked to any process.

When I run Luxmark on Polaris via ROCm, there is just ~1% CPU load.

dbermond commented 5 years ago

Glad to see that there is progress :)

@aufkrawall But shipping packages with Debug build type would not be suitable for the Arch repository :-/

patrolez commented 5 years ago

This is not only an issue related to Arch.

I am experiencing the same issue with binaries shipped by LuxMark (3.1 and 4.x) and Ubuntu Bionic with intel-opencl-icd package installed from ppa:intel-opencl/intel-opencl.

I would say if Debug release is performing well, so there are bugs related to optimizations steps during compilation/linking/LLVM translation or somewhere else.

I was using once downloaded LuxMark 3.1 for some time, and with some point in time, with some intel package update this free(): invalid pointer started to appear, while it was working correctly for great amount of time on my Intel hardware.

aufkrawall commented 5 years ago

@dbermond I can confirm that it works with the packages rolled out by you in Arch testing repo. :+1:

dbermond commented 5 years ago

@aufkrawall You're really fast! Thank you for testing it and confirming that it works for you. :)

dbermond commented 5 years ago

As Jacek discovered, the issue was tracked down to be caused by the compiler optimization flag in spirv-llvm-translator. It works with -O0, and the crash happens with -O3 and -O2. This is something that should be fixed upstream by spirv-llvm-translator (Khronos), because it should work regardless of the optimization level, specially with -O2 that is generally considered safe.

I think this issue should be moved to spirv-llvm-translator for a proper upstream fix.

JacekDanecki commented 5 years ago

@patrolez On my two setups with Ubuntu 18.04 I've installed Luxmark 3.1 and intel-opencl-icd 19.44.14658-1~ppa1~bionic (with all dependencies), and on first setup Luxmark works fine, but on another there is abort. I need to find differences between these setups.

aufkrawall commented 5 years ago

@dbermond Thanks for rolling out fixed packages in stable repo. Have you noticed that intel-compute-runtime requires a recompile?

dbermond commented 5 years ago

@aufkrawall Thanks for your feedback. I'm just uploading new version of neo rebuilt against latest igc. You're really fast as always ;)

I've also switched spirv-llvm-translator to use the shared library.

aufkrawall commented 5 years ago

Thanks, it simply works now.

JacekDanecki commented 5 years ago

@patrolez I've found difference on my Ubuntu setups. On setup where Luxmark did not work, I'd installed gcc-9 with library libstdc++.so.6.0.28 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu. When I downgraded libstdc++6, so it provided libstdc++.so.6.0.25, Luxmark started to work.

As I can see in Arch in package gcc-libs 9.2.0-2 there is libstdc++.so.6.0.27. It'd interested to check whether under Arch with older libstdc++ library Luxmark works with spirv-llvm-translator compiled with optimization enabled.

Luxmark works Under Ubuntu 19.10 (containing libstdc++.so.6.0.28) with Neo packages from ppa. As I can see in the build log spirv-llvm-translator was compiled with -O3 parameter. I'm building IGC components using llvm/clang sources on launchpad.

patrolez commented 5 years ago

@patrolez I've found difference on my Ubuntu setups. On setup where Luxmark did not work, I'd installed gcc-9 with library libstdc++.so.6.0.28 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu. When I downgraded libstdc++6, so it provided libstdc++.so.6.0.25, Luxmark started to work.

@JacekDanecki: I have just followed what you have mentioned over there and I can confirm, that downgrading libstdc++6 made LuxMark working on my machine too without touching another packages. I am using that toolchain :+1: (Or since now "will use on demand" :P)

Nice! :1st_place_medal:

JacekDanecki commented 4 years ago

Closing issue here, as it looks like problem with spirv-llvm-translator, not Neo itself.