On my mac, system divide gives the fastest result

cj1128 commented 3 years ago

I compiled and ran the benchmark with AVX2 on my mac and got a strange result.

                 === libdivide s32 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   3.241  14.439  -1.000   4.723   0.000   30.836   0
    -1   3.241  14.438  18.811   4.721   7.964   31.028   0
     2   3.241  14.449  18.800   4.723   7.966   30.836   0
    -2   3.243  14.523  18.819   4.721   7.968   31.028   0
     3   3.243  17.937  18.830   6.823   7.970   46.032   2
    -3   3.243  17.797  18.851   6.823   7.968   48.355   2
     4   3.241  14.438  18.733   4.721   7.966   30.821   0
    -4   3.243  14.445  18.733   4.721   7.966   31.028   0
     5   3.243  16.423  18.817   5.557   7.966   43.945   1
    -5   3.243  16.341  18.771   5.557   7.966   46.194   1
     6   3.243  17.783  18.735   6.823   7.968   46.017   2
    -6   3.243  17.867  18.735   6.823   7.966   48.355   2
     7   3.243  17.802  18.817   6.823   7.972   46.032   2
    -7   3.243  17.882  18.733   6.821   7.966   48.369   2
     8   3.243  14.483  18.815   4.721   7.966   30.836   0
    -8   3.243  14.445  18.828   4.723   7.968   31.043   0
     9   3.243  16.347  18.794   5.555   7.966   44.004   1
    -9   3.243  16.427  18.737   5.557   7.970   46.194   1
    10   3.243  16.345  18.798   5.555   7.966   43.931   1
   -10   3.243  16.349  18.733   5.557   7.966   46.180   1

It turns out that the hardware divide is the fastest one, and libdivide is much slower than the hardware.

I don't quite get it, maybe I did something wrong?

Here is my computer info:

$ uname -a
Darwin CJs-MacBook-Pro.local 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64 i386 MacBookPro11,4 Darwin
$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.14.6
BuildVersion:   18G95
$ sysctl -a | rg machdep.cpu
machdep.cpu.max_basic: 13
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
machdep.cpu.family: 6
machdep.cpu.model: 70
machdep.cpu.extmodel: 4
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 1
machdep.cpu.feature_bits: 9221959987971750911
machdep.cpu.leaf7_feature_bits: 10155 0
machdep.cpu.leaf7_feature_bits_edx: 2617246720
machdep.cpu.extfeature_bits: 142473169152
machdep.cpu.signature: 263777
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MDCLEAR IBRS STIBP L1DF SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 27
machdep.cpu.processor_flag: 5
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 270624
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 7 832 832 0
machdep.cpu.xsave.extended_state1: 1 0 0 0
machdep.cpu.arch_perf.version: 3
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.size: 256
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.tlb.shared: 1024
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.core_count: 4
machdep.cpu.thread_count: 8
machdep.cpu.tsc_ccc.numerator: 0
machdep.cpu.tsc_ccc.denominator: 0

ridiculousfish commented 3 years ago

That's weird, not at all what I see. What compiler did you use? Can you please share the build log? make VERBOSE=1 or ninja -v ?

cj1128 commented 3 years ago

I just went into the test dir and ran this command clang++ -DLIBDIVIDE_AVX2 -mavx2 -I.. benchmark.cpp.

$ clang++ --version
Apple clang version 11.0.0 (clang-1100.0.33.12)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

ridiculousfish commented 3 years ago

You'll want to turn on optimizations. Try passing -O3 or -Os.

cj1128 commented 3 years ago

My bad. I compiled with -Os and everything worked as expected except for s32.

$ ./optimized.out s32

                 === libdivide s32 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   2.654   2.679  -1.000   0.553   0.000    3.536   0
    -1   2.652   2.652   1.541   0.553   0.733    3.536   0
     2   2.913   2.652   1.541   0.553   0.733    3.536   0
    -2   2.913   2.652   1.541   0.553   0.733    3.536   0
     3   2.913   2.656   1.539   0.557   0.735    7.960   2
    -3   2.913   2.654   1.539   0.555   0.733    7.960   2
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342
Failure on line 346
Failure on line 342

What does it mean by Failure on line xxx?

$ ./optimized.out u64

                 === libdivide u64 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   8.305   0.889  -1.000   0.247   0.000    3.240   0
     2   7.959   0.885   1.167   0.239   0.701    3.240   0
     3   7.961   1.208   1.170   0.610   0.705   26.441   1
     4   7.961   0.887   1.170   0.249   0.703    3.240   0
     5   7.961   1.210   1.172   0.608   0.707   26.441   1
     6   7.961   1.191   1.170   0.608   0.707   26.441   1
     7   7.961   1.334   1.170   0.699   0.701   28.084   2
     8   7.959   0.885   1.170   0.241   0.701    3.240   0
     9   7.961   1.203   1.167   0.602   0.701   26.441   1
    10   7.963   1.212   1.168   0.608   0.707   26.441   1
    11   7.985   1.243   1.182   0.612   0.711   26.441   1
    12   7.961   1.197   1.170   0.608   0.707   26.441   1
$ ./optimized.out u32

                 === libdivide u32 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   3.539   0.906  -1.000   0.118   0.000    3.311   0
     2   3.541   0.883   1.163   0.114   0.253    3.311   0
     3   3.537   1.178   1.161   0.204   0.251   11.581   1
     4   3.537   0.883   1.161   0.114   0.251    3.296   0
     5   3.537   1.178   1.146   0.204   0.251   11.581   1
     6   3.537   1.178   1.146   0.206   0.251   11.581   1
     7   3.537   1.270   1.146   0.253   0.253   13.596   2
     8   3.537   0.883   1.146   0.116   0.251    3.296   0
     9   3.539   1.178   1.146   0.204   0.251   11.581   1
$ ./optimized.out s64

                 === libdivide s64 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   8.637   2.658  -1.000   1.119   0.000    4.128   0
    -1   8.601   2.656   1.528   1.125   1.970    4.128   0
     2   8.755   2.654   1.528   1.129   1.972    4.113   0
    -2   8.690   2.656   1.528   1.127   1.970    4.128   0
     3   8.641   2.656   1.526   1.702   1.972   30.066   2
    -3   8.709   2.658   1.528   1.702   1.970   30.066   2

ridiculousfish commented 3 years ago

The "Failure" indicates that libdivide and hardware division are disagreeing on the result. In this case, the problem is signed integer overflow in the benchmark tool (happily not in libdivide itself, only its test). Signed integer overflow is undefined behavior. Fixed as 858349bfc73d7669456edde96428df01ff6c615c. Nice catch!

cj1128 commented 3 years ago

Thank you for all the explanation.

I had the impression that computer division is very expensive but it turns out that the hardware s32 division is pretty fast and at the same level as scalar.

Do you have any idea why is that?

ridiculousfish commented 3 years ago

It looks like with -Os the compiler is emitting a separate function call instead of inlining it. I should mark those functions as force-inlined.

Try with -O3, you should see predicted results. Here's what I see:

clang++ -I. -O3 test/benchmark.cpp && ./a.out s32

                 === libdivide s32 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   1.709   0.174  -1.000   0.000   0.000    5.460   0
    -1   1.709   0.175   0.587   0.000   0.000    5.460   0
     2   1.709   0.175   0.587   0.000   0.000    5.460   0
    -2   1.709   0.175   0.595   0.000   0.000    5.460   0
     3   1.711   0.547   0.597   0.000   0.000    8.005   2
    -3   1.711   0.546   0.603   0.000   0.000    8.005   2
     4   1.709   0.175   0.589   0.000   0.000    5.460   0
    -4   1.709   0.175   0.595   0.000   0.000    5.460   0
     5   1.709   0.486   0.586   0.000   0.000    7.295   1
    -5   1.709   0.500   0.595   0.000   0.000    7.295   1
     6   1.709   0.546   0.589   0.000   0.000    7.990   2
    -6   1.709   0.544   0.595   0.000   0.000    7.990   2
     7   1.709   0.546   0.582   0.000   0.000    7.990   2
    -7   1.709   0.544   0.587   0.000   0.000    7.990   2
     8   1.709   0.175   0.578   0.000   0.000    5.460   0

so 3-4x faster.

ridiculousfish commented 3 years ago

61ae9cb48e6f17526e78202c8dc319f6f1dd2302 marks the functions as forced-inline which should improve performance on -Os.

Incidentally the recommended way to run the benchmark is through the CMake file:

mkdir build && cd build
cmake .. && make
./benchmark s32

it takes some care to try to get the flags right.

yb303 commented 3 years ago

I tend to check for optimized in my benchmark, just in case :)

On Fri, 2 Apr 2021 at 19:54, ridiculousfish @.***> wrote:

61ae9cb https://github.com/ridiculousfish/libdivide/commit/61ae9cb48e6f17526e78202c8dc319f6f1dd2302 marks the functions as forced-inline which should improve performance on -Os.

Incidentally the recommended way to run the benchmark is through the CMake file:

mkdir build && cd build cmake .. && make ./benchmark s32

it takes some care to try to get the flags right.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ridiculousfish/libdivide/issues/68#issuecomment-812665180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2OXU76BZKEQJRWQGNJ3KTTGYHIDANCNFSM42CPOYQA .

cj1128 commented 3 years ago

61ae9cb marks the functions as forced-inline which should improve performance on -Os.

Incidentally the recommended way to run the benchmark is through the CMake file:
mkdir build && cd build
cmake .. && make
./benchmark s32
it takes some care to try to get the flags right.

I did what you said and it gave me the expected results. Thank you!

ridiculousfish / libdivide

On my mac, system divide gives the fastest result #68