glenvt18 / libdvbcsa

GNU General Public License v2.0
11 stars 16 forks source link

Unaligned access on Amlogic S905 with 32bit userland #1

Closed Ray-future closed 7 years ago

Ray-future commented 7 years ago

Following discussion just for interest. https://github.com/LibreELEC/LibreELEC.tv/pull/1053

I narrowed down the problem to this: https://github.com/glenvt18/libdvbcsa/blob/97898b085d7e78c42e6fd03b1c1622a418ebd735/src/dvbcsa_pv.h#L108-L110

I don't have a clue about most of it but under arm kernel we have the alignment trap to correct it. Under arm64 we don't have a trap.

glenvt18 commented 7 years ago

So, if you change that, for instance, to

#elif defined(DVBCSA_UNALIGNED_ACCESS_32xxx)

it works? What test, testbitslice? Would you mind running the binaries I built https://drive.google.com/open?id=0BxEZpdTX1bPvVl96MkVFSmJ2Wm8

And you still don't have /proc/cpu/alignment options with 64bit kernel, only SIGBUS signal?

EDIT. Also try compiling with -O0, -O1, -O2, -fno-strict-aliasing. Not together, but in turn. Examine your gcc command line to make sure your option is at the end. You can also change configure.ac for that (don't forget run ./bootstrap after that).

EDIT. Could you upload your build log. I need to see your compiler options. Maybe I can reproduce it.

Ray-future commented 7 years ago

http://sprunge.us/QZMV Your binary works.

Makes me think there is something wrong with the toolchain. But again uatest works

glenvt18 commented 7 years ago

Could you run

./testenc && ./testdec && ./testbsops && ./testbitslice && seq 10 | xargs -n1 ./benchbitslice

with my binaries.

Ray-future commented 7 years ago

http://sprunge.us/PQXA

Unfortunately I left my Harddrive with my dev files at work. I'll have to post it tomorrow (12h from now).

Here are my sources. https://github.com/Raybuntu/LibreELEC.tv/blob/repair-009/packages/addons/addon-depends/libdvbcsa/package.mk

I just changed the commit to your recent PR.

This might be of interest:

pre_configure_target() {
# libdvbcsa is a bit faster without LTO, and tests will fail with gcc-5.x
  strip_lto

  export CFLAGS="$CFLAGS -fPIC"
}
glenvt18 commented 7 years ago

OK. I'll describe you the problem because at now +12h I'll be sleeping:)

It looks like gcc thinks it can do unaligned 64bit access (which is not allowed on ARM) and combines 2x32bit xors into one 64bit xor (probable using neon). To prove/refute it I need:

  1. Assembly sources of the suspect libdvbcsa (from github). Add -save-temps to CFLAGS:

    export CFLAGS="$CFLAGS -fPIC -save-temps"

    build it, then tar the whole libdvbcsa-XXXX directory, and upload the tar.gz file. Don't remove -save-temps at further steps.

  2. Build without optimizations. Change configure.ac here

    GCC_CFLAGS="$CFLAGS -O3 -funroll-loops -fomit-frame-pointer -D_XOPEN_SOURCE=600"

    Use -O0, -O1, -O2 in turn. Run

    ./testenc && ./testdec && ./testbsops && ./testbitslice && ./benchbitslice

    after each step and look for bus errors. If ./testbitslice fails run ./benchbitslice to see if it fails too. Attach assembly sources.

  3. Build with -O3 -fno-strict-aliasing. The same as 2.

So, for each step I need 1) assembly sources (tar.gz file) 2) build log (gcc command lines) 3) result

Also

yourgcc -dM -E - </dev/null
yourgcc -v

BTW. Do all executables I uploaded work without bus errors? It's important.

strip_lto was done on purpose: https://github.com/OpenELEC/OpenELEC.tv/pull/4815#issuecomment-195662577

Ray-future commented 7 years ago

I haven't seen any bus errors with your binaries. With the LE build only testbitslice failed with bus error and but not if I remove this: https://github.com/glenvt18/libdvbcsa/blob/97898b085d7e78c42e6fd03b1c1622a418ebd735/src/dvbcsa_pv.h#L108-L110

I'm getting approx. 160mbit/s and but your binaries are faster.

I'll do the tests and send you all the files.

glenvt18 commented 7 years ago

Thanks. I've just tested with your CFLAGS (from git) and my ubuntu cross compiler (pls compare):

arm-linux-gnueabihf-gcc -DHAVE_CONFIG_H -I. -I..    -I../src -Wall -march=armv8-a+crc -mabi=aapcs-linux -Wno-psabi -Wa,-mno-warn-deprecated 
-mcpu=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -fomit-frame-pointer -Wall -pipe -Os -O3 -funroll-loops -fomit-frame-pointer -D_XOPEN_
SOURCE=600 -mfpu=neon -MT benchbitslice-benchbitslice.o -MD -MP -MF .deps/benchbitslice-benchbitslice.Tpo -c -o benchbitslice-benchbitslice.
o `test -f 'benchbitslice.c' || echo './'`benchbitslice.c

on RPi2. No alignment violations. PROJECT=Odroid_C2 ARCH=arm right?

My binaries use unaligned access with xor and neon. Yes, they must be faster.

It looks like your toolchain is doing some nasty things... Unfortunately, I can't build your LibreELEC toolchain right now - don't have enough RAM for it.

So I'll wait for your files and then we'll see.

EDIT. Just to fix it. My gcc:

> arm-linux-gnueabihf-gcc -v
Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) 
Ray-future commented 7 years ago

Yes thats correct. I think one problem is that arm64 does not have an alignment trap. If I understand it correctly Kernel can correct alignment in ARM. The toolchain is build like a RPi2 toolchain in LE. Just that values are different mtune=cortex-a53, mcpu=armv8-a+crc and mfpu=neon-fp-armv8.

glenvt18 commented 7 years ago

Added -mtune=cortex-a53. No bus errors either.

We don't need a kernel trap. It will kill performance. libdvbcsa just must not by design do any unaligned accesses to memory which are not allowed on a particular platform. In our case it's 64 bit (32 bit is allowed). Fixing inside the kernel just masks the real problem.

My toolchain was simply installed from the Ubuntu repository. I run OpenELEC on my RPi2, but have no OpenELEC/LibreELEC toolchain. Tests and the library itself are built statically. They should generally work in environments close to each other (in terms of versions or time). I've run them in Linaro 12.10 (12 is the year) with custom kernel 3.4, just copied the binaries, and they worked.

Ray-future commented 7 years ago

Thanks. I have to admit I'm pretty clueless about this. I'll try your suggestions about gcc and I hope to resolve this. But since you can produce a compilation that works on my Odroid C2 with 64/32 there is no reason it can't in LE.

kszaq commented 7 years ago

@glenvt18 Do your tests pass on RPi if you set echo 5 > /proc/cpu/alignment?

glenvt18 commented 7 years ago

@Raybuntu We're working on it.

@kszaq Yes, they pass. Could you get these binaries https://drive.google.com/open?id=0BxEZpdTX1bPvVl96MkVFSmJ2Wm8 and try them on your S805 with echo 5 > /proc/cpu/alignment?

glenvt18 commented 7 years ago

I've built another batch of binaries, this time with linaro gcc-6.2.1. Pls try them https://drive.google.com/open?id=0BxEZpdTX1bPvX2FPeHNJS2F6Zk0

glenvt18 commented 7 years ago

@Raybuntu After doing 1. (assembly sources) don't proceed with 2. and 3,, but try glenvt18/libdvbcsa@73605787c67e4271f2b752b7779d682038796aa9.

Ray-future commented 7 years ago

@glenvt18 Here you go: http://raybuntu.libreelec.tv/tmp/libdvbcsa-97898b0.tar.gz Running your gcc-6.2.1 binaries: http://sprunge.us/TidW Testing 7360578: http://raybuntu.libreelec.tv/tmp/libdvbcsa-7360578.tar.gz http://sprunge.us/BFMM It seems to work now.

kszaq commented 7 years ago

@glenvt18 Here are my results:

S805 (Cortex-A5), 32-bit kernel, 32-bit userspace

S905X (Cortex-A53), 64-bit kernel, 32-bit userspace

glenvt18 commented 7 years ago

Thank you folks.

@Raybuntu This instruction from dvbcsa_algo.s triggered your bus errors:

        ldmdb   r3, {r0, ip}

My toolchain doesn't generate it. Could you upload yourgcc -v. About 180/160 Mbits difference in performance. Benchmarks with toolchains I used (old code), from slowest to fastest:

  gcc 4.7  70
  gcc 6.2  81 * similar to the one used by LibreELEC
  gcc 4.9  84
  gcc 5.4  87

But my difference is smaller. All of them are from Linaro.

@kszaq Could you build and test glenvt18/libdvbcsa@7360578 on your targets with /proc/cpu/alignment=5.

@Raybuntu @kszaq It's off the topic, but could you run the batch of tests from this post https://github.com/LibreELEC/LibreELEC.tv/pull/1053#issuecomment-266932933 on your S905s in AArch64 userspace mode? That will help me to tune 64 bit performance. Close all CPU-consuming processes (Kodi etc.) before starting.

Ray-future commented 7 years ago

@glenvt18

./armv8a-libreelec-linux-gnueabi-gcc-6.2.0 -v
Using built-in specs.
COLLECT_GCC=./armv8a-libreelec-linux-gnueabi-gcc-6.2.0
COLLECT_LTO_WRAPPER=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/lib/gcc/armv8a-libreelec-linux-gnueabi/6.2.0/lto-wrapper
Target: armv8a-libreelec-linux-gnueabi
Configured with: /home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/gcc-6.2.0/configure --host=x86_64-linux-gnu --build=x86_64-linux-gnu --prefix=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --bindir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/bin --sbindir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/sbin --sysconfdir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/etc --libexecdir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/lib --localstatedir=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/var --disable-static --enable-shared --target=armv8a-libreelec-linux-gnueabi --with-sysroot=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain/armv8a-libreelec-linux-gnueabi/sysroot --with-gmp=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-mpfr=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-mpc=/home/ray/Entwicklung/LibreELEC.tv/build.LibreELEC-Odroid_C2.arm-8.0-devel/toolchain --with-gnu-as --with-gnu-ld --enable-plugin --enable-lto --enable-gold --enable-ld=default --disable-multilib --disable-nls --enable-checking=release --with-default-libstdcxx-abi=gcc4-compatible --without-ppl --without-cloog --disable-libada --disable-libmudflap --disable-libatomic --disable-libitm --disable-libquadmath --disable-libgomp --disable-libmpx --disable-libssp --enable-languages=c,c++ --enable-__cxa_atexit --enable-decimal-float --enable-tls --enable-shared --disable-static --enable-c99 --enable-long-long --enable-threads=posix --disable-libstdcxx-pch --enable-libstdcxx-time --enable-clocale=gnu --with-abi=aapcs-linux --with-arch=armv8-a --with-float=hard --with-fpu=neon-fp-armv8
Thread model: posix
gcc version 6.2.0 (GCC) 
./armv8a-libreelec-linux-gnueabi-gcc-6.2.0 -dM -E - </dev/null |pastebinit

http://paste.debian.net/902516/

Ray-future commented 7 years ago

@glenvt18: Here is the bench for aarch64 userspace on a Odroid C2: http://sprunge.us/OYLb

glenvt18 commented 7 years ago

@Raybuntu Thank you very much. It can give some clues.

BTW. Could you build and run glenvt18/libdvbcsa@7360578 on aarch64.

Ray-future commented 7 years ago

Here you go @glenvt18 http://sprunge.us/JAIM

glenvt18 commented 7 years ago

Thanks. Looks good - no noticeable performance drop. So we can consider that 7360578 fixes this issue.

Ray-future commented 7 years ago

Thanks again for looking into this.

Ray-future commented 7 years ago

I think its fixed so I'm closing the issue. @glenvt18 if you need more testing in the future just ping me. Thanks for all the work and effort you put into this.