Closed cj1128 closed 3 years ago
That's weird, not at all what I see. What compiler did you use? Can you please share the build log? make VERBOSE=1
or ninja -v
?
I just went into the test
dir and ran this command clang++ -DLIBDIVIDE_AVX2 -mavx2 -I.. benchmark.cpp
.
$ clang++ --version
Apple clang version 11.0.0 (clang-1100.0.33.12)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
You'll want to turn on optimizations. Try passing -O3
or -Os
.
My bad. I compiled with -Os
and everything worked as expected except for s32
.
$ ./optimized.out s32
=== libdivide s32 benchmark ===
# system scalar scl_bf vector vec_bf gener algo
1 2.654 2.679 -1.000 0.553 0.000 3.536 0
-1 2.652 2.652 1.541 0.553 0.733 3.536 0
2 2.913 2.652 1.541 0.553 0.733 3.536 0
-2 2.913 2.652 1.541 0.553 0.733 3.536 0
3 2.913 2.656 1.539 0.557 0.735 7.960 2
-3 2.913 2.654 1.539 0.555 0.733 7.960 2
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342
What does it mean by Failure on line xxx
?
$ ./optimized.out u64
=== libdivide u64 benchmark ===
# system scalar scl_bf vector vec_bf gener algo
1 8.305 0.889 -1.000 0.247 0.000 3.240 0
2 7.959 0.885 1.167 0.239 0.701 3.240 0
3 7.961 1.208 1.170 0.610 0.705 26.441 1
4 7.961 0.887 1.170 0.249 0.703 3.240 0
5 7.961 1.210 1.172 0.608 0.707 26.441 1
6 7.961 1.191 1.170 0.608 0.707 26.441 1
7 7.961 1.334 1.170 0.699 0.701 28.084 2
8 7.959 0.885 1.170 0.241 0.701 3.240 0
9 7.961 1.203 1.167 0.602 0.701 26.441 1
10 7.963 1.212 1.168 0.608 0.707 26.441 1
11 7.985 1.243 1.182 0.612 0.711 26.441 1
12 7.961 1.197 1.170 0.608 0.707 26.441 1
$ ./optimized.out u32
=== libdivide u32 benchmark ===
# system scalar scl_bf vector vec_bf gener algo
1 3.539 0.906 -1.000 0.118 0.000 3.311 0
2 3.541 0.883 1.163 0.114 0.253 3.311 0
3 3.537 1.178 1.161 0.204 0.251 11.581 1
4 3.537 0.883 1.161 0.114 0.251 3.296 0
5 3.537 1.178 1.146 0.204 0.251 11.581 1
6 3.537 1.178 1.146 0.206 0.251 11.581 1
7 3.537 1.270 1.146 0.253 0.253 13.596 2
8 3.537 0.883 1.146 0.116 0.251 3.296 0
9 3.539 1.178 1.146 0.204 0.251 11.581 1
$ ./optimized.out s64
=== libdivide s64 benchmark ===
# system scalar scl_bf vector vec_bf gener algo
1 8.637 2.658 -1.000 1.119 0.000 4.128 0
-1 8.601 2.656 1.528 1.125 1.970 4.128 0
2 8.755 2.654 1.528 1.129 1.972 4.113 0
-2 8.690 2.656 1.528 1.127 1.970 4.128 0
3 8.641 2.656 1.526 1.702 1.972 30.066 2
-3 8.709 2.658 1.528 1.702 1.970 30.066 2
The "Failure" indicates that libdivide and hardware division are disagreeing on the result. In this case, the problem is signed integer overflow in the benchmark tool (happily not in libdivide itself, only its test). Signed integer overflow is undefined behavior. Fixed as 858349bfc73d7669456edde96428df01ff6c615c. Nice catch!
Thank you for all the explanation.
I had the impression that computer division is very expensive but it turns out that the hardware s32
division is pretty fast and at the same level as scalar
.
Do you have any idea why is that?
It looks like with -Os
the compiler is emitting a separate function call instead of inlining it. I should mark those functions as force-inlined.
Try with -O3
, you should see predicted results. Here's what I see:
clang++ -I. -O3 test/benchmark.cpp && ./a.out s32
=== libdivide s32 benchmark ===
# system scalar scl_bf vector vec_bf gener algo
1 1.709 0.174 -1.000 0.000 0.000 5.460 0
-1 1.709 0.175 0.587 0.000 0.000 5.460 0
2 1.709 0.175 0.587 0.000 0.000 5.460 0
-2 1.709 0.175 0.595 0.000 0.000 5.460 0
3 1.711 0.547 0.597 0.000 0.000 8.005 2
-3 1.711 0.546 0.603 0.000 0.000 8.005 2
4 1.709 0.175 0.589 0.000 0.000 5.460 0
-4 1.709 0.175 0.595 0.000 0.000 5.460 0
5 1.709 0.486 0.586 0.000 0.000 7.295 1
-5 1.709 0.500 0.595 0.000 0.000 7.295 1
6 1.709 0.546 0.589 0.000 0.000 7.990 2
-6 1.709 0.544 0.595 0.000 0.000 7.990 2
7 1.709 0.546 0.582 0.000 0.000 7.990 2
-7 1.709 0.544 0.587 0.000 0.000 7.990 2
8 1.709 0.175 0.578 0.000 0.000 5.460 0
so 3-4x faster.
61ae9cb48e6f17526e78202c8dc319f6f1dd2302 marks the functions as forced-inline which should improve performance on -Os.
Incidentally the recommended way to run the benchmark is through the CMake file:
mkdir build && cd build
cmake .. && make
./benchmark s32
it takes some care to try to get the flags right.
I tend to check for optimized in my benchmark, just in case :)
On Fri, 2 Apr 2021 at 19:54, ridiculousfish @.***> wrote:
61ae9cb https://github.com/ridiculousfish/libdivide/commit/61ae9cb48e6f17526e78202c8dc319f6f1dd2302 marks the functions as forced-inline which should improve performance on -Os.
Incidentally the recommended way to run the benchmark is through the CMake file:
mkdir build && cd build cmake .. && make ./benchmark s32
it takes some care to try to get the flags right.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ridiculousfish/libdivide/issues/68#issuecomment-812665180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2OXU76BZKEQJRWQGNJ3KTTGYHIDANCNFSM42CPOYQA .
61ae9cb marks the functions as forced-inline which should improve performance on -Os.
Incidentally the recommended way to run the benchmark is through the CMake file:
mkdir build && cd build cmake .. && make ./benchmark s32
it takes some care to try to get the flags right.
I did what you said and it gave me the expected results. Thank you!
I compiled and ran the benchmark with AVX2 on my mac and got a strange result.
It turns out that the hardware divide is the fastest one, and libdivide is much slower than the hardware.
I don't quite get it, maybe I did something wrong?
Here is my computer info: