lighttransport / embree-aarch64

AARCH64 port of Embree ray tracing library
Apache License 2.0
48 stars 11 forks source link

Performance issue in LOW_QUALITY BVH build on specific ARM processor(Especially Snapdragon 82x) #12

Closed syoyo closed 4 years ago

syoyo commented 5 years ago

From PR #11


I've noticed a performance issue that is only present on ARM devices: LOW_QUALITY BVH construction is slower than expected. For example, 1000,000 triangles take:

LOW_QUALITY: 0.74 seconds, 1.34 Mprims/s, 266 SAH build quality MEDIUM_QUALITY: 0.57 seconds, 1.75 Mprims/s, 249 SAH build quality HIGH_QUALITY: 1.47 seconds, 0.68 Mprims/s, 249 SAH build quality


syoyo commented 4 years ago

I think PR https://github.com/lighttransport/embree-aarch64/pull/14 would solve this issue.

maikschulze commented 4 years ago

Hi,

I've briefly tested the current master state https://github.com/lighttransport/embree-aarch64/commit/36ad8171faed603d9dd302ab27383e3818971f7c with my replication of the buildbench tutorial on my Android aarch64 smartphone (OnePlus 3T).

Unfortunately, the situation has not improved for this particular performance defect: LOW_QUALITY: 0.74 seconds, 1.34 Mprims/s, 266 SAH build quality MEDIUM_QUALITY: 0.55 seconds, 1.81 Mprims/s, 249 SAH build quality HIGH_QUALITY: 1.86 seconds, 0.54 Mprims/s, 249 SAH build quality

The low-quality builder is still too slow. Please note that BUILD_IOS was not enabled. I will try repeat this test either on iOS or by enabling this flag on Android whereever applicable.

syoyo commented 4 years ago

Can you please test with neon-fix branch? It solves some NEON issue https://github.com/lighttransport/embree-aarch64/issues/17 by backporting BUILD_IOS code path(some NEON fix/improvement by @pchang0414 )

I will also try to run buildbench on our Jetson AGX Xavier. @maikschulze Which scene data did you use for benchmarking?

maikschulze commented 4 years ago

I made a mistake in my comment, I meant to refer to my replication of the bvh_builder tutorial, which synthesizes geometry. Sorry for the confusion.

I will have a look at the other branch. It seems you have already done the work, I intended to do. Thanks :)

maikschulze commented 4 years ago

I've tested the neon-fix branch https://github.com/lighttransport/embree-aarch64/commit/7863a1ce95af0da85a3554811842821c81925708 with the internal tasking system and obtain:

LOW_QUALITY: 0.82 seconds, 1.22 Mprims/s, 266 SAH build quality MEDIUM_QUALITY: 0.50 seconds, 2.02 Mprims/s, 249 SAH build quality HIGH_QUALITY: 1.39 seconds, 0.72 Mprims/s, 249 SAH build quality

syoyo commented 4 years ago

Here is the result of bvh_builder on Jetson AGX Xavier(ARMv8 Processor rev 0 (v8l). 8 cores) using neon-fix branch.

gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=Release \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=gcc \
  -DCMAKE_CXX_COMPILER=g++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 467.245ms, 4.92247 Mprims/s, sah = 363.265 [DONE]
iteration 1: building BVH over 2300000 primitives, 189.316ms, 12.149 Mprims/s, sah = 363.265 [DONE]
iteration 2: building BVH over 2300000 primitives, 199.916ms, 11.5048 Mprims/s, sah = 363.265 [DONE]
iteration 3: building BVH over 2300000 primitives, 174.691ms, 13.1661 Mprims/s, sah = 363.265 [DONE]
iteration 4: building BVH over 2300000 primitives, 178.087ms, 12.915 Mprims/s, sah = 363.265 [DONE]
iteration 5: building BVH over 2300000 primitives, 196.053ms, 11.7315 Mprims/s, sah = 363.265 [DONE]
iteration 6: building BVH over 2300000 primitives, 216.538ms, 10.6217 Mprims/s, sah = 363.265 [DONE]
iteration 7: building BVH over 2300000 primitives, 163.449ms, 14.0717 Mprims/s, sah = 363.265 [DONE]
iteration 8: building BVH over 2300000 primitives, 203.061ms, 11.3267 Mprims/s, sah = 363.265 [DONE]
iteration 9: building BVH over 2300000 primitives, 199.604ms, 11.5228 Mprims/s, sah = 363.265 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 793.211ms, 2.89961 Mprims/s, sah = 340.853 [DONE]
iteration 1: building BVH over 2300000 primitives, 436.282ms, 5.27182 Mprims/s, sah = 340.853 [DONE]
iteration 2: building BVH over 2300000 primitives, 440.721ms, 5.21872 Mprims/s, sah = 340.853 [DONE]
iteration 3: building BVH over 2300000 primitives, 430.462ms, 5.3431 Mprims/s, sah = 340.853 [DONE]
iteration 4: building BVH over 2300000 primitives, 447.181ms, 5.14333 Mprims/s, sah = 340.853 [DONE]
iteration 5: building BVH over 2300000 primitives, 429.659ms, 5.35308 Mprims/s, sah = 340.853 [DONE]
iteration 6: building BVH over 2300000 primitives, 368.533ms, 6.24096 Mprims/s, sah = 340.853 [DONE]
iteration 7: building BVH over 2300000 primitives, 380.974ms, 6.03716 Mprims/s, sah = 340.853 [DONE]
iteration 8: building BVH over 2300000 primitives, 387.172ms, 5.94051 Mprims/s, sah = 340.853 [DONE]
iteration 9: building BVH over 2300000 primitives, 412.654ms, 5.57368 Mprims/s, sah = 340.853 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1400.94ms, 1.64176 Mprims/s, sah = 339.742 [DONE]
iteration 1: building BVH over 2300000 primitives, 1122.35ms, 2.04927 Mprims/s, sah = 339.742 [DONE]
iteration 2: building BVH over 2300000 primitives, 999.972ms, 2.30006 Mprims/s, sah = 339.742 [DONE]
iteration 3: building BVH over 2300000 primitives, 862.534ms, 2.66656 Mprims/s, sah = 339.742 [DONE]
iteration 4: building BVH over 2300000 primitives, 811.949ms, 2.83269 Mprims/s, sah = 339.742 [DONE]
iteration 5: building BVH over 2300000 primitives, 834.805ms, 2.75513 Mprims/s, sah = 339.742 [DONE]
iteration 6: building BVH over 2300000 primitives, 821.769ms, 2.79884 Mprims/s, sah = 339.742 [DONE]
iteration 7: building BVH over 2300000 primitives, 796.84ms, 2.8864 Mprims/s, sah = 339.742 [DONE]
iteration 8: building BVH over 2300000 primitives, 754.551ms, 3.04817 Mprims/s, sah = 339.742 [DONE]
iteration 9: building BVH over 2300000 primitives, 736.78ms, 3.12169 Mprims/s, sah = 339.742 [DONE]

clang 9.0.0

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=Release \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 330.425ms, 6.96073 Mprims/s, sah = 363.265 [DONE]
iteration 1: building BVH over 2300000 primitives, 217.422ms, 10.5785 Mprims/s, sah = 363.265 [DONE]
iteration 2: building BVH over 2300000 primitives, 145.361ms, 15.8227 Mprims/s, sah = 363.265 [DONE]
iteration 3: building BVH over 2300000 primitives, 264.551ms, 8.69398 Mprims/s, sah = 363.265 [DONE]
iteration 4: building BVH over 2300000 primitives, 214.144ms, 10.7404 Mprims/s, sah = 363.265 [DONE]
iteration 5: building BVH over 2300000 primitives, 215.484ms, 10.6737 Mprims/s, sah = 363.265 [DONE]
iteration 6: building BVH over 2300000 primitives, 207.744ms, 11.0713 Mprims/s, sah = 363.265 [DONE]
iteration 7: building BVH over 2300000 primitives, 217.249ms, 10.5869 Mprims/s, sah = 363.265 [DONE]
iteration 8: building BVH over 2300000 primitives, 205.477ms, 11.1935 Mprims/s, sah = 363.265 [DONE]
iteration 9: building BVH over 2300000 primitives, 205.345ms, 11.2007 Mprims/s, sah = 363.265 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 771.947ms, 2.97948 Mprims/s, sah = 340.853 [DONE]
iteration 1: building BVH over 2300000 primitives, 482.536ms, 4.76648 Mprims/s, sah = 340.853 [DONE]
iteration 2: building BVH over 2300000 primitives, 388.218ms, 5.9245 Mprims/s, sah = 340.853 [DONE]
iteration 3: building BVH over 2300000 primitives, 387.673ms, 5.93283 Mprims/s, sah = 340.853 [DONE]
iteration 4: building BVH over 2300000 primitives, 376.233ms, 6.11323 Mprims/s, sah = 340.853 [DONE]
iteration 5: building BVH over 2300000 primitives, 373.749ms, 6.15386 Mprims/s, sah = 340.853 [DONE]
iteration 6: building BVH over 2300000 primitives, 371.707ms, 6.18767 Mprims/s, sah = 340.853 [DONE]
iteration 7: building BVH over 2300000 primitives, 372.158ms, 6.18017 Mprims/s, sah = 340.853 [DONE]
iteration 8: building BVH over 2300000 primitives, 351.09ms, 6.55103 Mprims/s, sah = 340.853 [DONE]
iteration 9: building BVH over 2300000 primitives, 359.239ms, 6.40242 Mprims/s, sah = 340.853 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1334.42ms, 1.7236 Mprims/s, sah = 339.742 [DONE]
iteration 1: building BVH over 2300000 primitives, 789.567ms, 2.91299 Mprims/s, sah = 339.742 [DONE]
iteration 2: building BVH over 2300000 primitives, 758.453ms, 3.03249 Mprims/s, sah = 339.742 [DONE]
iteration 3: building BVH over 2300000 primitives, 776.326ms, 2.96267 Mprims/s, sah = 339.742 [DONE]
iteration 4: building BVH over 2300000 primitives, 813.967ms, 2.82567 Mprims/s, sah = 339.742 [DONE]
iteration 5: building BVH over 2300000 primitives, 807.859ms, 2.84703 Mprims/s, sah = 339.742 [DONE]
iteration 6: building BVH over 2300000 primitives, 713.595ms, 3.22312 Mprims/s, sah = 339.742 [DONE]
iteration 7: building BVH over 2300000 primitives, 714.242ms, 3.2202 Mprims/s, sah = 339.742 [DONE]
iteration 8: building BVH over 2300000 primitives, 753.975ms, 3.0505 Mprims/s, sah = 339.742 [DONE]
iteration 9: building BVH over 2300000 primitives, 665.994ms, 3.45348 Mprims/s, sah = 339.742 [DONE]

At least there is no performance degradation both for gcc and clang on Jetson AGX(ARM A72(?) core).

maikschulze commented 4 years ago

Thank you very much for posting the benchmark results. I will take a closer look at my "version" of the test and check other HW as well. So far, I don't have numbers for my iOS devices.

syoyo commented 4 years ago

And here is the result from Pixel4 + Termux. I have created another branch non-glfw https://github.com/lighttransport/embree-aarch64/tree/non-glfw, which builds bvh_builder without glfw dependency.

clang 9.0.1

iteration 0: building BVH over 2300000 primitives, 372.282ms, 6.17811 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 248.929ms, 9.23958 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 267.801ms, 8.58847 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 262.416ms, 8.76471 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 265.919ms, 8.64924 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 261.571ms, 8.79303 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 272.91ms, 8.42769 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 260.978ms, 8.813 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 267.578ms, 8.59562 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 264.649ms, 8.69075 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 711.938ms, 3.23062 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 595.141ms, 3.86463 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 663.287ms, 3.46758 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 639.961ms, 3.59397 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 608.185ms, 3.78175 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 588.025ms, 3.9114 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 867.439ms, 2.65148 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 665.757ms, 3.45471 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 751.099ms, 3.06218 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 698.145ms, 3.29444 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1282.62ms, 1.7932 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 1331.65ms, 1.72718 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 1307.26ms, 1.7594 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 1374.44ms, 1.67341 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 1512.5ms, 1.52066 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 1470.11ms, 1.56451 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 1468.75ms, 1.56596 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 1307.38ms, 1.75924 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 1333.34ms, 1.725 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 1321.41ms, 1.74057 Mprims/s, sah = 339.806 [DONE]

So, the performance issue may come from Android NDK build configuration.

NDK toolchain adds some extra compiler flags:

https://android.googlesource.com/platform/ndk/+/master/build/cmake/android.toolchain.cmake#449

which may affect the performance.

syoyo commented 4 years ago

non-glfw branch with Android NDK(r21) build on Pixel4 (Put built binary to /data/local/tmp and add +x then execute it)

# Use ANDROID_SDK_HOME environment
ANDROID_NDK_ROOT=$ANDROID_SDK_ROOT/ndk-bundle

# CMake 3.6 or later required.
CMAKE_BIN=cmake

rm -rf build-android
mkdir build-android
cd build-android

$CMAKE_BIN -G Ninja -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_NATIVE_API_LEVEL=24 \
  -DANDROID_ARM_MODE=arm \
  -DANDROID_ARM_NEON=TRUE \
  -DANDROID_STL=c++_static \
  -DEMBREE_ARM=On \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..

cd ..
~
$ LD_LIBRARY_PATH=. ./bvh_builder
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 320.482ms, 7.17669 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 253.758ms, 9.06376 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 271.481ms, 8.47205 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 280.434ms, 8.20158 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 261.063ms, 8.81014 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 279.063ms, 8.24187 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 260.827ms, 8.8181 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 252.432ms, 9.11136 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 262.714ms, 8.75477 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 251.372ms, 9.14978 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 621.555ms, 3.7004 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 648.118ms, 3.54874 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 642.976ms, 3.57712 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 562.866ms, 4.08623 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 632.965ms, 3.63369 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 587.982ms, 3.91168 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 637.619ms, 3.60717 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 701.085ms, 3.28063 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 547.128ms, 4.20377 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 651.321ms, 3.53129 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1366.26ms, 1.68342 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 1221.68ms, 1.88265 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 1293.02ms, 1.77878 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 1244.98ms, 1.84742 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 1229.1ms, 1.87129 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 1267.59ms, 1.81447 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 1304.91ms, 1.76258 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 1373.47ms, 1.67459 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 1497.47ms, 1.53592 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 1228.78ms, 1.87178 Mprims/s, sah = 339.806 [DONE]
syoyo commented 4 years ago

Same binary used for Pixel4 on ZenFone Max(m2)(Snapdragon 632. A53 core)

Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1473.86ms, 1.56053 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 1187.6ms, 1.93669 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 1193.86ms, 1.92652 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 1168.3ms, 1.96867 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 1300.34ms, 1.76877 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 1525.81ms, 1.50739 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 1484.5ms, 1.54935 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 1217.63ms, 1.88892 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 1680.47ms, 1.36867 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 1207.49ms, 1.90477 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 2875.91ms, 0.799748 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 4150.64ms, 0.554132 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 2784.03ms, 0.82614 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 2826.1ms, 0.813842 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 2791.85ms, 0.823826 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 3473.95ms, 0.66207 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 2739.15ms, 0.839677 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 2735.06ms, 0.840931 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 3475.67ms, 0.661744 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 2811.84ms, 0.81797 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 5607.71ms, 0.410149 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 5948.25ms, 0.386668 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 5399.26ms, 0.425984 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 5374.93ms, 0.427912 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 5705.12ms, 0.403147 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 5565.67ms, 0.413247 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 5938.19ms, 0.387324 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 5233.09ms, 0.439511 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 5612.24ms, 0.409819 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 5466.35ms, 0.420756 Mprims/s, sah = 339.806 [DONE]

So, apparently the performance is linear even on A53 cores.

maikschulze commented 4 years ago

Hi,

after dissecting various commits and compiler settings which did not help at all, I've deployed my test application to my colleague's Android device. The issue disappeared!

An identical aarch64 binary results in the following measurements:

Xiaomi Mi 9T Pro (Snapdragon 855 CPU)

Time iteration (seconds): 0.025513 Time iteration (seconds): 0.015728 Time iteration (seconds): 0.014720 Time iteration (seconds): 0.014457 Time iteration (seconds): 0.016226 Time iteration (seconds): 0.014536 Time iteration (seconds): 0.014560 Time iteration (seconds): 0.014105 Time iteration (seconds): 0.015324 Time iteration (seconds): 0.014603 Low Quality Build Time (seconds): 0.161452 Low Quality Build Rate (Mprims/s): 6.193801 Low Quality Build Quality (SAH): 265.793854 Time iteration (seconds): 0.029731 Time iteration (seconds): 0.030952 Time iteration (seconds): 0.031231 Time iteration (seconds): 0.027482 Time iteration (seconds): 0.034549 Time iteration (seconds): 0.034750 Time iteration (seconds): 0.039156 Time iteration (seconds): 0.032344 Time iteration (seconds): 0.034419 Time iteration (seconds): 0.032151 Medium Quality Build Time (seconds): 0.326960 Medium Quality Build Rate (Mprims/s): 3.058480 Medium Quality Build Quality (SAH): 249.085358 Time iteration (seconds): 0.081488 Time iteration (seconds): 0.070334 Time iteration (seconds): 0.050496 Time iteration (seconds): 0.059259 Time iteration (seconds): 0.050183 Time iteration (seconds): 0.055322 Time iteration (seconds): 0.058191 Time iteration (seconds): 0.052493 Time iteration (seconds): 0.053196 Time iteration (seconds): 0.058377 High Quality Build Time (seconds): 0.589467 High Quality Build Rate (Mprims/s): 1.696447 High Quality Build Quality (SAH): 248.987671

Oneplus 3T (Snapdragon 821 CPU, from my previous tests)

Time iteration (seconds): 0.139856 Time iteration (seconds): 0.106543 Time iteration (seconds): 0.092181 Time iteration (seconds): 0.100685 Time iteration (seconds): 0.071749 Time iteration (seconds): 0.072782 Time iteration (seconds): 0.072107 Time iteration (seconds): 0.072892 Time iteration (seconds): 0.077812 Time iteration (seconds): 0.069211 Low Quality Build Time (seconds): 0.877871 Low Quality Build Rate (Mprims/s): 1.139120 Low Quality Build Quality (SAH): 265.860352 Time iteration (seconds): 0.047721 Time iteration (seconds): 0.059651 Time iteration (seconds): 0.057134 Time iteration (seconds): 0.047721 Time iteration (seconds): 0.049204 Time iteration (seconds): 0.048780 Time iteration (seconds): 0.054862 Time iteration (seconds): 0.053493 Time iteration (seconds): 0.052970 Time iteration (seconds): 0.061554 Medium Quality Build Time (seconds): 0.533335 Medium Quality Build Rate (Mprims/s): 1.874994 Medium Quality Build Quality (SAH): 249.218781 Time iteration (seconds): 0.130922 Time iteration (seconds): 0.126150 Time iteration (seconds): 0.127971 Time iteration (seconds): 0.146135 Time iteration (seconds): 0.154689 Time iteration (seconds): 0.140980 Time iteration (seconds): 0.188261 Time iteration (seconds): 0.157469 Time iteration (seconds): 0.211994 Time iteration (seconds): 0.205527 High Quality Build Time (seconds): 1.590357 High Quality Build Rate (Mprims/s): 0.628790 High Quality Build Quality (SAH): 248.880981

Personally, I'm surprised by this result because the relative floating/integer performance ratios almost match in GeekBench 5. Another factor may be the different threading mechanisms due to OS and hardware. I would argue that given the positive results by you, @syoyo , and from my colleague's device, the priority of this issue has decreased a lot.

I will invest into porting my test app to iOS and measure the mileage on A13 chips.

syoyo commented 4 years ago

@maikschulze Thanks for the performance report.

It looks the issue is related to older generation CPU with big-little architecture(Snapdragon 821). I have Xperia X Performance(Snapdragon 820) so will try to do a benchmark soon.

I will invest into porting my test app to iOS and measure the mileage on A13 chips.

Good! A13 should run faster than Snapdragon 855!

syoyo commented 4 years ago

Here is the result from Xperia X Performance(Snapdragon 820)

134|SOV33:/data/local/tmp $ LD_LIBRARY_PATH=. ./bvh_builder
Low quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1844.27ms, 1.24711 Mprims/s, sah = 363.227 [DONE]
iteration 1: building BVH over 2300000 primitives, 1516.41ms, 1.51674 Mprims/s, sah = 363.227 [DONE]
iteration 2: building BVH over 2300000 primitives, 1507.61ms, 1.52559 Mprims/s, sah = 363.227 [DONE]
iteration 3: building BVH over 2300000 primitives, 1566.18ms, 1.46854 Mprims/s, sah = 363.227 [DONE]
iteration 4: building BVH over 2300000 primitives, 1511.98ms, 1.52118 Mprims/s, sah = 363.227 [DONE]
iteration 5: building BVH over 2300000 primitives, 1609.48ms, 1.42903 Mprims/s, sah = 363.227 [DONE]
iteration 6: building BVH over 2300000 primitives, 1515.24ms, 1.51791 Mprims/s, sah = 363.227 [DONE]
iteration 7: building BVH over 2300000 primitives, 1532.6ms, 1.50072 Mprims/s, sah = 363.227 [DONE]
iteration 8: building BVH over 2300000 primitives, 1503.24ms, 1.53003 Mprims/s, sah = 363.227 [DONE]
iteration 9: building BVH over 2300000 primitives, 1504.22ms, 1.52903 Mprims/s, sah = 363.227 [DONE]
Normal quality BVH build:
iteration 0: building BVH over 2300000 primitives, 1151.75ms, 1.99697 Mprims/s, sah = 340.895 [DONE]
iteration 1: building BVH over 2300000 primitives, 1114.04ms, 2.06457 Mprims/s, sah = 340.895 [DONE]
iteration 2: building BVH over 2300000 primitives, 1116.13ms, 2.0607 Mprims/s, sah = 340.895 [DONE]
iteration 3: building BVH over 2300000 primitives, 1100ms, 2.09092 Mprims/s, sah = 340.895 [DONE]
iteration 4: building BVH over 2300000 primitives, 1105.78ms, 2.07998 Mprims/s, sah = 340.895 [DONE]
iteration 5: building BVH over 2300000 primitives, 1118.25ms, 2.05679 Mprims/s, sah = 340.895 [DONE]
iteration 6: building BVH over 2300000 primitives, 1146.07ms, 2.00686 Mprims/s, sah = 340.895 [DONE]
iteration 7: building BVH over 2300000 primitives, 1135.19ms, 2.0261 Mprims/s, sah = 340.895 [DONE]
iteration 8: building BVH over 2300000 primitives, 1129.74ms, 2.03587 Mprims/s, sah = 340.895 [DONE]
iteration 9: building BVH over 2300000 primitives, 1106.5ms, 2.07862 Mprims/s, sah = 340.895 [DONE]
High quality BVH build:
iteration 0: building BVH over 2300000 primitives, 4017.36ms, 0.572516 Mprims/s, sah = 339.806 [DONE]
iteration 1: building BVH over 2300000 primitives, 4046.4ms, 0.568407 Mprims/s, sah = 339.806 [DONE]
iteration 2: building BVH over 2300000 primitives, 4088.83ms, 0.562509 Mprims/s, sah = 339.806 [DONE]
iteration 3: building BVH over 2300000 primitives, 4474.04ms, 0.514077 Mprims/s, sah = 339.806 [DONE]
iteration 4: building BVH over 2300000 primitives, 3881.43ms, 0.592565 Mprims/s, sah = 339.806 [DONE]
iteration 5: building BVH over 2300000 primitives, 3927.56ms, 0.585605 Mprims/s, sah = 339.806 [DONE]
iteration 6: building BVH over 2300000 primitives, 3890.86ms, 0.591128 Mprims/s, sah = 339.806 [DONE]
iteration 7: building BVH over 2300000 primitives, 3894.5ms, 0.590577 Mprims/s, sah = 339.806 [DONE]
iteration 8: building BVH over 2300000 primitives, 4020.9ms, 0.572012 Mprims/s, sah = 339.806 [DONE]
iteration 9: building BVH over 2300000 primitives, 3996.46ms, 0.575509 Mprims/s, sah = 339.806 [DONE]

LOW_QUALITY is slower than MID_QUALITY as observed in OnePlus 3T(Snapdragon 821). So the situation will be processor specific(especially Snapdragon 82x series).

maikschulze commented 4 years ago

Hi,

I've found time to port my benchmarks to other platforms. As a baseline to measure the performance and quality of the code changes from the last months, I've compiled the state https://github.com/lighttransport/embree-aarch64/commit/a9ab7e6392fb92af0e4c2d4db802561893c6cb51 . As a next step, I will test the newer contributions and report the results of my comparison.

Here are the measured BVH building rates for the state from June 2019:

OnePlus 3T (Snapdragon 821 CPU) @ Android arm64 NEON Low Quality Build Rate (Mprims/s): 1.170872 Medium Quality Build Rate (Mprims/s): 1.373946 High Quality Build Rate (Mprims/s): 0.514601

Apple iPhone XS (A12 CPU) @ iOS arm64 NEON Low Quality Build Rate (Mprims/s): 5.299200 Medium Quality Build Rate (Mprims/s): 3.624712 High Quality Build Rate (Mprims/s): 2.121203

Apple iPad Air (A12 CPU) @ iOS arm64 NEON Low Quality Build Rate (Mprims/s): 5.232715 Medium Quality Build Rate (Mprims/s): 3.664790 High Quality Build Rate (Mprims/s): 2.060475

Apple iPad Pro (A12X CPU) @ iOS arm64 NEON Low Quality Build Rate (Mprims/s): 6.667206 Medium Quality Build Rate (Mprims/s): 4.490212 High Quality Build Rate (Mprims/s): 2.712091

Google Pixelbook (i5-7Y57 CPU) @ Android x64 SSE2 Low Quality Build Rate (Mprims/s): 3.106385 Medium Quality Build Rate (Mprims/s): 1.424219 High Quality Build Rate (Mprims/s): 0.836109

Apple MacBook Pro (i9-9880H CPU) @ Windows 10 x64 SSE2 Low Quality Build Rate (Mprims/s): 14.665057 Medium Quality Build Rate (Mprims/s): 7.881895 High Quality Build Rate (Mprims/s): 5.461523

Apparently, the performance problem of the low-quality builder remains specific to the older gen chips such as Snapdragon 820 and Snapdragon 821. A12 chips do not show any performance problem either. I would therefore argue to close this issue and consider the originally raised issue a hardware problem, not a problem of the code base.

syoyo commented 4 years ago

@maikschulze Thanks for the benchmark!

I would therefore argue to close this issue and consider the originally raised issue a hardware problem, not a problem of the code base.

So can I close this issue?

maikschulze commented 4 years ago

So can I close this issue?

Yes, please.

syoyo commented 4 years ago

Yes, please.

Thanks. So close the issue since it looks its a HW architectural issue.