Closed Ray-future closed 7 years ago
So, if you change that, for instance, to
#elif defined(DVBCSA_UNALIGNED_ACCESS_32xxx)
it works? What test, testbitslice? Would you mind running the binaries I built https://drive.google.com/open?id=0BxEZpdTX1bPvVl96MkVFSmJ2Wm8
And you still don't have /proc/cpu/alignment
options with 64bit kernel, only SIGBUS signal?
EDIT.
Also try compiling with -O0
, -O1
, -O2
, -fno-strict-aliasing
. Not together, but in turn. Examine your gcc command line to make sure your option is at the end. You can also change configure.ac for that (don't forget run ./bootstrap after that).
EDIT. Could you upload your build log. I need to see your compiler options. Maybe I can reproduce it.
http://sprunge.us/QZMV Your binary works.
Makes me think there is something wrong with the toolchain. But again uatest works
Could you run
./testenc && ./testdec && ./testbsops && ./testbitslice && seq 10 | xargs -n1 ./benchbitslice
with my binaries.
Unfortunately I left my Harddrive with my dev files at work. I'll have to post it tomorrow (12h from now).
Here are my sources. https://github.com/Raybuntu/LibreELEC.tv/blob/repair-009/packages/addons/addon-depends/libdvbcsa/package.mk
I just changed the commit to your recent PR.
This might be of interest:
pre_configure_target() {
# libdvbcsa is a bit faster without LTO, and tests will fail with gcc-5.x
strip_lto
export CFLAGS="$CFLAGS -fPIC"
}
OK. I'll describe you the problem because at now +12h I'll be sleeping:)
It looks like gcc thinks it can do unaligned 64bit access (which is not allowed on ARM) and combines 2x32bit xors into one 64bit xor (probable using neon). To prove/refute it I need:
Assembly sources of the suspect libdvbcsa (from github). Add -save-temps
to CFLAGS:
export CFLAGS="$CFLAGS -fPIC -save-temps"
build it, then tar the whole libdvbcsa-XXXX directory, and upload the tar.gz file. Don't remove -save-temps at further steps.
Build without optimizations. Change configure.ac here
GCC_CFLAGS="$CFLAGS -O3 -funroll-loops -fomit-frame-pointer -D_XOPEN_SOURCE=600"
Use -O0, -O1, -O2 in turn. Run
./testenc && ./testdec && ./testbsops && ./testbitslice && ./benchbitslice
after each step and look for bus errors. If ./testbitslice fails run ./benchbitslice to see if it fails too. Attach assembly sources.
Build with -O3 -fno-strict-aliasing. The same as 2.
So, for each step I need 1) assembly sources (tar.gz file) 2) build log (gcc command lines) 3) result
Also
yourgcc -dM -E - </dev/null
yourgcc -v
BTW. Do all executables I uploaded work without bus errors? It's important.
strip_lto
was done on purpose:
https://github.com/OpenELEC/OpenELEC.tv/pull/4815#issuecomment-195662577
I haven't seen any bus errors with your binaries. With the LE build only testbitslice failed with bus error and but not if I remove this: https://github.com/glenvt18/libdvbcsa/blob/97898b085d7e78c42e6fd03b1c1622a418ebd735/src/dvbcsa_pv.h#L108-L110
I'm getting approx. 160mbit/s and but your binaries are faster.
I'll do the tests and send you all the files.
Thanks. I've just tested with your CFLAGS (from git) and my ubuntu cross compiler (pls compare):
arm-linux-gnueabihf-gcc -DHAVE_CONFIG_H -I. -I.. -I../src -Wall -march=armv8-a+crc -mabi=aapcs-linux -Wno-psabi -Wa,-mno-warn-deprecated
-mcpu=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -fomit-frame-pointer -Wall -pipe -Os -O3 -funroll-loops -fomit-frame-pointer -D_XOPEN_
SOURCE=600 -mfpu=neon -MT benchbitslice-benchbitslice.o -MD -MP -MF .deps/benchbitslice-benchbitslice.Tpo -c -o benchbitslice-benchbitslice.
o `test -f 'benchbitslice.c' || echo './'`benchbitslice.c
on RPi2. No alignment violations. PROJECT=Odroid_C2 ARCH=arm
right?
My binaries use unaligned access with xor and neon. Yes, they must be faster.
It looks like your toolchain is doing some nasty things... Unfortunately, I can't build your LibreELEC toolchain right now - don't have enough RAM for it.
So I'll wait for your files and then we'll see.
EDIT. Just to fix it. My gcc:
> arm-linux-gnueabihf-gcc -v
Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)
Yes thats correct. I think one problem is that arm64 does not have an alignment trap. If I understand it correctly Kernel can correct alignment in ARM. The toolchain is build like a RPi2 toolchain in LE. Just that values are different mtune=cortex-a53, mcpu=armv8-a+crc and mfpu=neon-fp-armv8.
Added -mtune=cortex-a53. No bus errors either.
We don't need a kernel trap. It will kill performance. libdvbcsa just must not by design do any unaligned accesses to memory which are not allowed on a particular platform. In our case it's 64 bit (32 bit is allowed). Fixing inside the kernel just masks the real problem.
My toolchain was simply installed from the Ubuntu repository. I run OpenELEC on my RPi2, but have no OpenELEC/LibreELEC toolchain. Tests and the library itself are built statically. They should generally work in environments close to each other (in terms of versions or time). I've run them in Linaro 12.10 (12 is the year) with custom kernel 3.4, just copied the binaries, and they worked.
Thanks. I have to admit I'm pretty clueless about this. I'll try your suggestions about gcc and I hope to resolve this. But since you can produce a compilation that works on my Odroid C2 with 64/32 there is no reason it can't in LE.
@glenvt18 Do your tests pass on RPi if you set echo 5 > /proc/cpu/alignment?
@Raybuntu We're working on it.
@kszaq Yes, they pass. Could you get these binaries
https://drive.google.com/open?id=0BxEZpdTX1bPvVl96MkVFSmJ2Wm8
and try them on your S805 with echo 5 > /proc/cpu/alignment
?
I've built another batch of binaries, this time with linaro gcc-6.2.1. Pls try them https://drive.google.com/open?id=0BxEZpdTX1bPvX2FPeHNJS2F6Zk0
@Raybuntu After doing 1. (assembly sources) don't proceed with 2. and 3,, but try glenvt18/libdvbcsa@73605787c67e4271f2b752b7779d682038796aa9.
@glenvt18 Here you go: http://raybuntu.libreelec.tv/tmp/libdvbcsa-97898b0.tar.gz Running your gcc-6.2.1 binaries: http://sprunge.us/TidW Testing 7360578: http://raybuntu.libreelec.tv/tmp/libdvbcsa-7360578.tar.gz http://sprunge.us/BFMM It seems to work now.
@glenvt18 Here are my results:
S805 (Cortex-A5), 32-bit kernel, 32-bit userspace
Illegal instruction (core dumped)
S905X (Cortex-A53), 64-bit kernel, 32-bit userspace
Thank you folks.
@Raybuntu This instruction from dvbcsa_algo.s triggered your bus errors:
ldmdb r3, {r0, ip}
My toolchain doesn't generate it. Could you upload yourgcc -v
.
About 180/160 Mbits difference in performance. Benchmarks with toolchains I used (old code), from slowest to fastest:
gcc 4.7 70
gcc 6.2 81 * similar to the one used by LibreELEC
gcc 4.9 84
gcc 5.4 87
But my difference is smaller. All of them are from Linaro.
@kszaq Could you build and test glenvt18/libdvbcsa@7360578 on your targets with /proc/cpu/alignment=5.
@Raybuntu @kszaq It's off the topic, but could you run the batch of tests from this post https://github.com/LibreELEC/LibreELEC.tv/pull/1053#issuecomment-266932933 on your S905s in AArch64 userspace mode? That will help me to tune 64 bit performance. Close all CPU-consuming processes (Kodi etc.) before starting.
@glenvt18
./armv8a-libreelec-linux-gnueabi-gcc-6.2.0 -v
Using built-in specs.
COLLECT_GCC=./armv8a-libreelec-linux-gnueabi-gcc-6.2.0
COLLECT_LTO_WRAPPER=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/lib/gcc/armv8a-libreelec-linux-gnueabi/6.2.0/lto-wrapper
Target: armv8a-libreelec-linux-gnueabi
Configured with: /home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/gcc-6.2.0/configure --host=x86_64-linux-gnu --build=x86_64-linux-gnu --prefix=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --bindir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/bin --sbindir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/sbin --sysconfdir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/etc --libexecdir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/lib --localstatedir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/var --disable-static --enable-shared --target=armv8a-libreelec-linux-gnueabi --with-sysroot=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/armv8a-libreelec-linux-gnueabi/sysroot --with-gmp=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-mpfr=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-mpc=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-gnu-as --with-gnu-ld --enable-plugin --enable-lto --enable-gold --enable-ld=default --disable-multilib --disable-nls --enable-checking=release --with-default-libstdcxx-abi=gcc4-compatible --without-ppl --without-cloog --disable-libada --disable-libmudflap --disable-libatomic --disable-libitm --disable-libquadmath --disable-libgomp --disable-libmpx --disable-libssp --enable-languages=c,c++ --enable-__cxa_atexit --enable-decimal-float --enable-tls --enable-shared --disable-static --enable-c99 --enable-long-long --enable-threads=posix --disable-libstdcxx-pch --enable-libstdcxx-time --enable-clocale=gnu --with-abi=aapcs-linux --with-arch=armv8-a --with-float=hard --with-fpu=neon-fp-armv8
Thread model: posix
gcc version 6.2.0 (GCC)
./armv8a-libreelec-linux-gnueabi-gcc-6.2.0 -dM -E - </dev/null |pastebinit
@glenvt18: Here is the bench for aarch64 userspace on a Odroid C2: http://sprunge.us/OYLb
@Raybuntu Thank you very much. It can give some clues.
BTW. Could you build and run glenvt18/libdvbcsa@7360578 on aarch64.
Here you go @glenvt18 http://sprunge.us/JAIM
Thanks. Looks good - no noticeable performance drop. So we can consider that 7360578 fixes this issue.
Thanks again for looking into this.
I think its fixed so I'm closing the issue. @glenvt18 if you need more testing in the future just ping me. Thanks for all the work and effort you put into this.
Following discussion just for interest. https://github.com/LibreELEC/LibreELEC.tv/pull/1053
I narrowed down the problem to this: https://github.com/glenvt18/libdvbcsa/blob/97898b085d7e78c42e6fd03b1c1622a418ebd735/src/dvbcsa_pv.h#L108-L110
I don't have a clue about most of it but under arm kernel we have the alignment trap to correct it. Under arm64 we don't have a trap.