Closed aufkrawall closed 4 years ago
Can you provide stack trace from gdb?
I'm on Arch, I'd have to recompile with debug symbols. Is it enough when intel-compute-runtime includes them?
Let's start with luxmark itself. Can you verify whether there was gpu hang in dmesg?
I've reproduced this issue with Luxmark 3.1 with:
$ pacman -Q intel-compute-runtime intel-graphics-compiler intel-opencl-clang
intel-compute-runtime 19.39.14278-1
intel-graphics-compiler 1:1.0.2652-1
intel-opencl-clang 9.0.0-1
under Arch
[New Thread 0x7fffacff9700 (LWP 20472)]
free(): invalid pointer
Thread 3 "luxmark.bin" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff51d2700 (LWP 20391)]
0x00007ffff718af25 in raise () from /usr/lib/libc.so.6
(gdb) where
#0 0x00007ffff718af25 in raise () from /usr/lib/libc.so.6
#1 0x00007ffff7174897 in abort () from /usr/lib/libc.so.6
#2 0x00007ffff71ce258 in __libc_message () from /usr/lib/libc.so.6
#3 0x00007ffff71d577a in malloc_printerr () from /usr/lib/libc.so.6
#4 0x00007ffff71d714c in _int_free () from /usr/lib/libc.so.6
#5 0x00000000015474af in std::locale::_Impl::~_Impl() ()
#6 0x000000000154768d in std::locale::~locale() ()
#7 0x00007ffff70992c4 in std::basic_streambuf<char, std::char_traits<char> >::~basic_streambuf (this=0x7ffff51cf638, __in_chrg=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/streambuf:204
#8 std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf (this=0x7ffff51cf638, __in_chrg=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/sstream:65
#9 std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::~basic_stringstream (this=0x7ffff51cf620, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/inclu
de/sstream:784
#10 0x00007fffec680071 in ?? () from /usr/lib/libopencl-clang.so.9
#11 0x0000000000000070 in ?? ()
#12 0x00007ffff51cf610 in ?? ()
#13 0x00007fffebb38560 in ?? ()
#14 0x00007fffe96233e8 in ?? ()
#15 0x00007ffff51cf5b0 in ?? ()
#16 0x00007ffff51cf6a0 in ?? ()
#17 0x00007ffff51cf670 in ?? ()
#18 0x0000000000000012 in ?? ()
#19 0x00007fffebe97488 in ?? ()
#20 0x0000000000000003 in ?? ()
#21 0x0000000000000000 in ?? ()
The same luxmark binaries together with Neo
dpkg -l intel-opencl intel-igc-core intel-igc-opencl | grep ^ii
ii intel-igc-core 1.0.2597 amd64 Intel(R) Graphics Compiler for OpenCL(TM)
ii intel-igc-opencl 1.0.2597 amd64 Intel(R) Graphics Compiler for OpenCL(TM)
ii intel-opencl 19.40.14409 amd64 Intel OpenCL GPU driver
works fine under Ubuntu 18.04.3 LTS. Tested with kernel: 5.3.1-x86_64
@dbermond Have you observed issues with intel-opencl-clang under Arch in other apps?
@jdanecki Nothing that I can observe or be aware of.
What we are experiencing is a build issue with neo 19.40, apparently with gen12 related code.
@dbermond Can you provide more details about build problems, or create new issue here? Is it a similar issue like here. If so, this is known issue on IGC side, and fix is in progress. When you get older IGC, the same as in Neo release 19.40, plus fix for gcc 9 compilation intel/intel-graphics-compiler@028414b376d12d7d6fbb4939bca2a31a02b6a18f, you will be able to compile Neo correctly. With this newer IGC commit Neo compiles correctly.
@JacekDanecki That's exactly this issue. Glad to see that a fix is under way.
Still crashes with intel-compute-runtime 19.40.14409-1 intel-graphics-compiler 1:1.0.2714-1
Stack trace (no idea how helpful it is without debug symbols): https://drive.google.com/open?id=1ZaPDnmG4_vRN-4rZnI8agYzJmzhrYylp
@aufkrawall As the issue is observed in intel-opecl-clang library I've compiled:
@dbermond Is it possible to downgrade both packages in Arch, so they work correctly with Neo? Actually only spirv-llvm-translator downgrade is requried, and opencl-clang rebuild.
@JacekDanecki I heavily appreciate your effort in helping to solve this issue in Arch Linux (and I'm sure you know it), but as a general rule we do not downgrade repository packages in such situation. We only downgrade when a package is utterly broken. If there is a patch that we can use, then I would gladly apply it.
Besides, even if downgrading would be an option, I would like to mention that the OP reported this issue with neo 19.37 and igc 1:1.0.11 at the time of his writing. These were already based on the previous versions of intel-opencl-clang (8.0.1) and spirv-llvm-translator (8.0.1.2).
Neo release 19.38.14237 is the latest release based on llvm/clang 8 on IGC side. Since intel/intel-graphics-compiler@7117adbaa6a5ffa055388251dcc2f8ae9e0a0851 IGC switched to newer intel/opencl-clang@v9.0.0 and KhronosGroup/SPIRV-LLVM-Translator@v9.0.0-1. Unfortunately these versions doesn't work in luxmark and debug is in progress.
When I removed directory /tmp/kernel_cache/LUXCORE_1.5, all scenes work correctly with IGC, opencl-clang, spirv-llvm-translator I mentioned earlier. I've checked it with the latest Neo intel/compute-runtime@bfc98631
New Neo release 19.43.14583 contains intel/intel-graphics-compiler@igc-1.0.2714.1 compiled with intel/opencl-clang@v9.0.0 and KhronosGroup/SPIRV-LLVM-Translator@v9.0.0-1. These binaries works with Luxmark. I've recompiled Neo and IGC components using these versions and Luxmark works under Arch.
@JacekDanecki Thanks for working on this.
@dbermond Is there some action required for the Arch packaging? It still crashes for me with the same build versions in the Arch repo as outlined by JacekDanecki.
@aufkrawall It crashes when I use new Arch packages too. @dbermond Here are steps I used to build whole Neo stack under Arch. Luxmark works with binaries created this way.
export llvm_commit=llvmorg-9.0.0
export opencl_clang_commit=9.0.0
export spirv_llvm_translator_commit=9.0.0-1
export llvm_patches_commit=1c93162ab33af968c22fe1cbfb12ea87f5a25bfa
export igc_commit=igc-1.0.2714.1
export neo_commit=19.43.14583
export gmmlib_commit=19.3.2
wget --no-check-certificate https://github.com/intel/gmmlib/archive/intel-gmmlib-${gmmlib_commit}.tar.gz
wget --no-check-certificate https://github.com/llvm/llvm-project/archive/${llvm_commit}/llvm-${llvm_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/opencl-clang/archive/v${opencl_clang_commit}/opencl-clang-${opencl_clang_commit}.tar.gz
wget --no-check-certificate https://github.com/KhronosGroup/SPIRV-LLVM-Translator/archive/v${spirv_llvm_translator_commit}/spirv-llvm-translator-${spirv_llvm_translator_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/llvm-patches/archive/${llvm_patches_commit}/llvm-patches-${llvm_patches_commit}.tar.gz
wget --no-check-certificate https://github.com/intel/intel-graphics-compiler/archive/${igc_commit}/igc-${igc_commit}.tar.gz
tar -xzf intel-gmmlib-${gmmlib_commit}.tar.gz
pushd gmmlib-intel-gmmlib-${gmmlib_commit}
mkdir build
pushd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DBUILD_TYPE=release -DRUN_TEST_SUITE:BOOL='OFF' -Wno-dev
make -j 10
make -j 10 install
popd
popd
tar -xzf llvm-${llvm_commit}.tar.gz
ln -s llvm-project-${llvm_commit} llvm-project
tar -xzf opencl-clang-${opencl_clang_commit}.tar.gz
pushd llvm-project/llvm/projects
ln -s ../../../opencl-clang-${opencl_clang_commit} opencl-clang
popd
tar -xzf spirv-llvm-translator-${spirv_llvm_translator_commit}.tar.gz
pushd llvm-project/llvm/projects
ln -s ../../../SPIRV-LLVM-Translator-${spirv_llvm_translator_commit} llvm-spirv
popd
tar -xzf llvm-patches-${llvm_patches_commit}.tar.gz
ln -s llvm-patches-${llvm_patches_commit} llvm_patches
tar -xzf igc-${igc_commit}.tar.gz
ln -s intel-graphics-compiler-${igc_commit} igc
mv llvm-project/clang llvm-project/llvm/tools/
mkdir build
pushd build
cmake ../igc/IGC -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev
make -j 10
make -j 10 install
popd
mkdir neo
cd neo
wget --no-check-certificate https://github.com/intel/compute-runtime/archive/${neo_commit}/neo_${neo_commit}.tar.gz
tar -xzf neo_${neo_commit}.tar.gz
mkdir -p compute-runtime-${neo_commit}/build
pushd compute-runtime-${neo_commit}/build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev -DSKIP_UNIT_TESTS=1
make -j 10
make -j 10 install
popd
I'll rebuild IGC components with llvm/clang binaries provided in Arch to check how they work with Luxmark.
@dbermond When I rebuilt spirv-llvm-translator, opencl-clang, igc with llvm/clang binaries from Arch, there is abort in luxmark
./luxmark
free(): invalid pointer
./luxmark: line 12: 822 Aborted (core dumped) ./luxmark.bin "$@"
I've found workaround. If you build spirv-llvm-translator with -DCMAKE_BUILD_TYPE=Debug, and recompile intel-opencl-clang, luxmark will work correctly. Both components can be compiled with system llvm/clang.
I've added -DCMAKE_BUILD_TYPE=Debug \
to the spirv-llvm-translator PKGBUILD section before -Wno-dev
, compiled & installed it, then compiled & installed intel-opencl-clang, but luxmark still crashes:
./luxmark free(): invalid pointer ./luxmark: line 12: 117014 Aborted (core dumped) ./luxmark.bin "$@"
Here is a script I used to rebuild spirv-llvm-translator and intel-opencl-clang. Before I run it luxmark crashed, after script execution luxmark works. Are you using different versions or cmake parameters?
wget https://github.com/KhronosGroup/SPIRV-LLVM-Translator/archive/v9.0.0-1.tar.gz
tar -xzf v9.0.0-1.tar.gz
pushd SPIRV-LLVM-Translator-9.0.0-1
mkdir build
pushd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Debug -Wno-dev -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j 10
make DESTDIR=install install
popd
popd
wget https://github.com/intel/opencl-clang/archive/v9.0.0.tar.gz
tar -xzf v9.0.0.tar.gz
pushd opencl-clang-9.0.0
mkdir build
pushd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -Wno-dev -DLLVMSPIRV_INCLUDED_IN_LLVM=OFF -DSPIRV_TRANSLATOR_DIR=`pwd`/../../SPIRV-LLVM-Translator-9.0.0-1/build/install/usr
make -j 10
make DESTDIR=install install
cp install/usr/lib/libopencl-clang.so.9 /usr/lib/libopencl-clang.so.9
@aufkrawall Add options=('!strip')
to the PKGBUILD when using -DCMAKE_BUILD_TYPE=Debug
so the debug symbols are not stripped from the elf files on the package. Probably you will need to use it in intel-opencl-clang too because spirv-llvm-translator ships a static library.
I did that and, as expected, the packages have massively grown in size. But it still crashes regardless:
free(): invalid pointer
./luxmark: line 12: 227167 Aborted (core dumped) ./luxmark.bin "$@"
The issue is with default flags set in /etc/makepkg.conf. When I removed flag -O2
--- makepkg.conf-orig 2019-11-04 17:27:50.929364959 +0100
+++ makepkg.conf 2019-11-04 17:53:29.381504944 +0100
@@ -38,7 +38,7 @@
#-- Compiler and Linker Flags
CPPFLAGS="-D_FORTIFY_SOURCE=2"
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt"
-CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt"
+CXXFLAGS="-march=x86-64 -mtune=generic -pipe -fstack-protector-strong -fno-plt"
LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now"
#-- Make Flags: change this for DistCC/SMP systems
MAKEFLAGS="-j`nproc`"
built and install SPIRV-LLVM-Translator as Debug and opencl-clang as Release using pacman, luxmark started to work correctly.
Here are changes I've made in PKGBUILD files
diff -Nurp spirv-llvm-translator-orig/PKGBUILD spirv-llvm-translator/PKGBUILD
--- spirv-llvm-translator-orig/PKGBUILD 2019-11-04 17:13:33.000000000 +0100
+++ spirv-llvm-translator/PKGBUILD 2019-11-04 17:21:57.937332840 +0100
@@ -22,6 +22,7 @@ build() {
cmake ../${_srcname}-${pkgver%.*}-${_build} \
-DCMAKE_INSTALL_PREFIX=/usr \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
+ -DCMAKE_BUILD_TYPE=Debug \
-Wno-dev
make
}
and
diff -Nurp intel-opencl-clang-orig/PKGBUILD intel-opencl-clang/PKGBUILD
--- intel-opencl-clang-orig/PKGBUILD 2019-11-04 17:12:31.000000000 +0100
+++ intel-opencl-clang/PKGBUILD 2019-11-04 17:26:23.834357034 +0100
@@ -24,7 +24,7 @@ build() {
-DCMAKE_INSTALL_PREFIX=/usr \
-DLLVMSPIRV_INCLUDED_IN_LLVM=OFF \
-DSPIRV_TRANSLATOR_DIR=/usr \
- -DLLVM_NO_DEAD_STRIP=ON \
+ -DCMAKE_BUILD_TYPE=Release \
-Wno-dev
make
}
It finally works:
Thanks a lot for your efforts, really appreciated!
@dbermond Would it be possible to ship the packages like this in Arch repo?
Edit: Well, I guess it would also be nice if it didn't require special treatment at compile time vs. other packages.
@JacekDanecki Is it expected to have two CPU threads being fully utilized while Luxmark runs on the IGP? There is 12.25% total load by the luxmark process itself and another 12.25% not linked to any process.
When I run Luxmark on Polaris via ROCm, there is just ~1% CPU load.
Glad to see that there is progress :)
@aufkrawall But shipping packages with Debug build type would not be suitable for the Arch repository :-/
This is not only an issue related to Arch.
I am experiencing the same issue with binaries shipped by LuxMark (3.1 and 4.x) and Ubuntu Bionic with intel-opencl-icd
package installed from ppa:intel-opencl/intel-opencl
.
I would say if Debug
release is performing well, so there are bugs related to optimizations steps during compilation/linking/LLVM translation or somewhere else.
I was using once downloaded LuxMark 3.1 for some time, and with some point in time, with some intel package update this free(): invalid pointer
started to appear, while it was working correctly for great amount of time on my Intel hardware.
@dbermond I can confirm that it works with the packages rolled out by you in Arch testing repo. :+1:
@aufkrawall You're really fast! Thank you for testing it and confirming that it works for you. :)
As Jacek discovered, the issue was tracked down to be caused by the compiler optimization flag in spirv-llvm-translator. It works with -O0
, and the crash happens with -O3
and -O2
. This is something that should be fixed upstream by spirv-llvm-translator (Khronos), because it should work regardless of the optimization level, specially with -O2
that is generally considered safe.
I think this issue should be moved to spirv-llvm-translator for a proper upstream fix.
@patrolez On my two setups with Ubuntu 18.04 I've installed Luxmark 3.1 and intel-opencl-icd 19.44.14658-1~ppa1~bionic
(with all dependencies), and on first setup Luxmark works fine, but on another there is abort. I need to find differences between these setups.
@dbermond Thanks for rolling out fixed packages in stable repo. Have you noticed that intel-compute-runtime requires a recompile?
@aufkrawall Thanks for your feedback. I'm just uploading new version of neo rebuilt against latest igc. You're really fast as always ;)
I've also switched spirv-llvm-translator to use the shared library.
Thanks, it simply works now.
@patrolez I've found difference on my Ubuntu setups. On setup where Luxmark did not work, I'd installed gcc-9 with library libstdc++.so.6.0.28 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu. When I downgraded libstdc++6, so it provided libstdc++.so.6.0.25, Luxmark started to work.
As I can see in Arch in package gcc-libs 9.2.0-2 there is libstdc++.so.6.0.27. It'd interested to check whether under Arch with older libstdc++ library Luxmark works with spirv-llvm-translator compiled with optimization enabled.
Luxmark works Under Ubuntu 19.10 (containing libstdc++.so.6.0.28) with Neo packages from ppa. As I can see in the build log spirv-llvm-translator was compiled with -O3 parameter. I'm building IGC components using llvm/clang sources on launchpad.
@patrolez I've found difference on my Ubuntu setups. On setup where Luxmark did not work, I'd installed gcc-9 with library libstdc++.so.6.0.28 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu. When I downgraded libstdc++6, so it provided libstdc++.so.6.0.25, Luxmark started to work.
@JacekDanecki: I have just followed what you have mentioned over there and I can confirm, that downgrading libstdc++6
made LuxMark working on my machine too without touching another packages.
I am using that toolchain :+1: (Or since now "will use on demand" :P)
Nice! :1st_place_medal:
Closing issue here, as it looks like problem with spirv-llvm-translator, not Neo itself.
When I try to run LuxMark 3.1 on the i7 6700k IGP, it crashes:
This is on Arch with linux 5.3.4 intel-compute-runtime 19.37.14191 intel-graphics-compiler 1:1.0.11
clinfo.log
IGP is used only for CL, display output and desktop rendering runs via dGPU (Polaris). On Polaris, LuxMark works with Clover, OCL-Orca- OCL-PAL and OCL-ROCm.