jhjourdan / SIMD-math-prims

Vectorizable implementations of some mathematical functions
MIT License
102 stars 12 forks source link

How can I change it to double precision? #2

Closed ghost closed 5 years ago

ghost commented 5 years ago

Could you give some hints?

I installed sollya. The output is 156 bits so it should be accurate enough for double precision. But I am not very familiar with the other part of the functions in simd_math_prims.h.

jhjourdan commented 5 years ago

If you want double precision, then I assume you want more than 24bits relative precision. Then, you need to use a polynomial with higher degree, and the computation will slowdown accordingly. Except if you are using vectorization this might not be worth it.

Anyway, if you want more detail about the use of Sollya, you may want to read the following blog post:

http://gallium.inria.fr/blog/fast-vectorizable-math-approx/

ghost commented 5 years ago

As far as I see in the code you did not use floor() function but use some magic numbers:

float exp_cst1 = 2139095040.f; //0x1.fe00000000000p+30
float exp_cst2 = 0.f; 
val2 = 12102203.1615614f*val+1065353216.f;
// 12102203.1615614f == 0x1.fffffffffffc0p+22/ log(2)
// 1065353216.f == 0x1.fc00000000000p+29

I am wondering how can I get those numbers for double precision as a substitute of floor() function.

I want to use the vectorizable exp and log in chemical kinetics so double precision is needed. And slightly higher accuracy than 1e-5, such as 1e-7, is needed. I am not sure if I can get enough acceleration on CPUs with AVX512 extension. But it is worth trying.

I installed sollya and I can understand the float point tricks in the blog you mentioned. The polynomial must be based on the basis functions with zero in both ends, so you used (x-1)(x-2)x**i bases to keep the function continous.

Besides, for higher degree polynomial evaluation, Estrin's scheme or other numerical polynomial evaluation method might be faster than Horner's method. But I am not sure if it is the case for SIMD vectorized function.

jhjourdan commented 5 years ago

I've just pushed 8067be0 for you. This commit implements double, more accurate versions of the mathematical functions, in the same fashion as the old ones.

The main pitfall is that the new functions for log and exp, logapprox and expapprox can only be vectorized on very recent Intel processors featuring the AVX512DQ extensions. The reason is that double->int64 casts and 64bits bit shifts are only supported in a vectorized fashion by processors supporting these extensions. I don't know which processor you are targeting, but I wouldn't bet they support them.

ghost commented 5 years ago

I still have a little confusion about the magic numbers.

First, the value of double number r is: image

for a bit pattern of a number like:

|S|E_10, ... , E_0| M_51, M_50, ... , M_0|

Sorry for my poor c language. I thought you used some shift technique to get the integer part and fraction part of the number. Actually, you used the val4i = (int64_t) val4; to convert double to int_64t. It looks like floor() function (at least to positive numbers). It is a type cast and approximation rather than a reinterpret_cast. And the fraction part is obtained by some shift and mask operations.

jhjourdan commented 5 years ago

So the idea of the implementation of exp is the following : First, you reduce to compute the exponential in base 2 by dividing by log(2). Then, you compute the floor and use the resulting value as the floatting-point exponent of the result. Finally, you approximate the exponential of the fractional part with a well-chosen polynomial. The product of the two parts give the final result.

Now, how should we implement that in practice? We can of course not use the floor function of stdlib, because it is a library function, which will lead to failure of vectorization. Casting to int directly would perfom rounding to 0, which is not what we want for negative values of the input, and, moreover, we would need one subsequent costly int->float conversion to get the fractional part.

Instead, you convert the input value to a well-chosen fixed-point format (encoded in an integer), and use the high-order bits for the integral part of the input value and the low-order bits as the fractional part. Note that (in floats) exponents are shifted by 127, so that we will need to add 127 somehow to the exponent.

When using the right fixed-point format (i.e, 23 bits for the fractional part), we only need bitwise masking (no shifting) for synthesizing the two floats in interest:

So, I hope you now understand the values of the magic constants:

Hope this helps...

BTW, will you be able to use my double version of the functions, in the end?

ghost commented 5 years ago

Thank you very much for the detail of implementation.

I see the fixed point algorithm. You are constructing a binary fixed point number as a bit pattern like:

0bxxxxxx.yyyyyyyyyyyy

by using some floating point algebra operations. And convert it to int32_t type.

The steps:

  1. a = x/log(2);
  2. b = a*(2<<23); // Now b as a float is actually an integer because M part of float a is 23bit.
  3. c = b+ 1272<<23; // compensate for bias. Now c as a float is still integer valued. Actually, it should be (a+127)(2<<23)
  4. c = min(max(c, lower_limit),upper_limit)
  5. d = (int32_t)c; // using vcvttpd2qq instruction, convert a float integer value to int32_t.

Now, d as an int32_t is an integer. d as a bit pattern is a fixed point binary number. The integer part of d as a bit pattern can be extracted by & operation.

I still have a problem, how the upper and lower limit `` is derived?

double exp_cst1_d = 9218868437227405312.;
double exp_cst2_d = 0.;

Anyway, the exp_d works and its accuracy of rtol = 1e-9 is satisfactory even in the whole range [-300, +200] of my parameter. But log_d do not accelerate well as single precision version.

Here is the output:

(base) [root@JD SIMD-math-prims]# make
g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native
(base) [root@JD SIMD-math-prims]# ./test_fun
Sin functions:
--------------
Comparing the behavior of sinapprox against sinf, in the interval [-3.14159, 3.14159]:
Bias:                   -2.560597e-09
Mean absolute error:    3.676327e-06
RMS error:              4.109845e-06
Min difference:         -5.960464e-06
Max difference:         5.938113e-06

Comparing the behavior of sinapprox_d against sin, in the interval [-3.14159, 3.14159]:
Bias:                   1.151537e-12
Mean absolute error:    7.186951e-10
RMS error:              8.026634e-10
Min difference:         -1.145714e-09
Max difference:         1.145729e-09

Benchmarking sinf...    243.8M/s
Benchmarking sinapprox...    7876.9M/s
Benchmarking sin...    341.3M/s
Benchmarking sinapprox_d...    2694.7M/s

Cos functions:
--------------
Comparing the behavior of cosapprox against cosf, in the interval [-3.14159, 3.14159]:
Bias:                   5.895721e-06
Mean absolute error:    2.777401e-05
RMS error:              3.156854e-05
Min difference:         -4.571676e-05
Max difference:         4.577637e-05

Comparing the behavior of cosapprox_d against cos, in the interval [-3.14159, 3.14159]:
Bias:                   1.367417e-11
Mean absolute error:    7.584514e-11
RMS error:              8.525821e-11
Min difference:         -1.227697e-10
Max difference:         1.227682e-10

Benchmarking cosf...    262.6M/s
Benchmarking cosapprox...    9309.1M/s
Benchmarking cos...    379.3M/s
Benchmarking cosapprox_d...    2694.7M/s

Log functions:
--------------
Comparing the behavior of logapprox against logf, in the interval [1e-10, 10]:
Bias:                   7.545441e-08
Mean absolute error:    5.554401e-06
RMS error:              6.175767e-06
Min difference:         -8.821487e-06
Max difference:         8.940697e-06

Comparing the behavior of icsi_log against logf, in the interval [1e-10, 10]:
Bias:                   1.043380e-05
Mean absolute error:    1.756708e-04
RMS error:              2.058682e-04
Min difference:         -4.689693e-04
Max difference:         4.639626e-04

Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias:                   -1.875015e-10
Mean absolute error:    2.895113e-09
RMS error:              3.218112e-09
Min difference:         -4.531327e-09
Max difference:         4.531155e-09

Benchmarking logf...    124.9M/s
Benchmarking icsi_log...    731.4M/s
Benchmarking logapprox...    4452.2M/s
Benchmarking log...    123.4M/s
Benchmarking logapprox_d...    307.5M/s

Exp functions:
--------------
Comparing the behavior of expapprox against expf, in the interval [-10, 10]:
Relative bias:                  -8.916760e-09
Mean relative error:            2.821630e-06
RMS relative error:             3.418452e-06
Min relative difference:        -8.690718e-06
Max relative difference:        8.117646e-06

Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]:
Relative bias:                  -1.935681e-11
Mean relative error:            1.417697e-09
RMS relative error:             1.573945e-09
Min relative difference:        -2.218639e-09
Max relative difference:        2.218602e-09

Benchmarking expf...    131.3M/s
Benchmarking expapprox...    3413.3M/s
Benchmarking exp...    204.8M/s
Benchmarking expapprox_d...    1383.8M/s

(base) [root@JD SIMD-math-prims]# uname -a
Linux JD 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
(base) [root@JD SIMD-math-prims]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    2
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
Stepping:              4
CPU MHz:               3192.500
BogoMIPS:              6385.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ibrs ibpb fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat spec_ctrl
(base) [root@JD SIMD-math-prims]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)

(base) [root@JD SIMD-math-prims]# gcc -march=native -Q --help=target
The following options are target specific:
  -m128bit-long-double                  [enabled]
  -m16                                  [disabled]
  -m32                                  [disabled]
  -m3dnow                               [disabled]
  -m3dnowa                              [disabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -m8bit-idiv                           [disabled]
  -m96bit-long-double                   [disabled]
  -mabi=                                sysv
  -mabm                                 [enabled]
  -maccumulate-outgoing-args            [disabled]
  -maddress-mode=                       long
  -madx                                 [enabled]
  -maes                                 [enabled]
  -malign-data=                         compat
  -malign-double                        [disabled]
  -malign-functions=                    0
  -malign-jumps=                        0
  -malign-loops=                        0
  -malign-stringops                     [enabled]
  -mandroid                             [disabled]
  -march=                               skylake-avx512
  -masm=                                att
  -mavx                                 [enabled]
  -mavx2                                [enabled]
  -mavx256-split-unaligned-load         [disabled]
  -mavx256-split-unaligned-store        [disabled]
  -mavx5124fmaps                        [disabled]
  -mavx5124vnniw                        [disabled]
  -mavx512bitalg                        [disabled]
  -mavx512bw                            [enabled]
  -mavx512cd                            [enabled]
  -mavx512dq                            [enabled]
  -mavx512er                            [disabled]
  -mavx512f                             [enabled]
  -mavx512ifma                          [disabled]
  -mavx512pf                            [disabled]
  -mavx512vbmi                          [disabled]
  -mavx512vbmi2                         [disabled]
  -mavx512vl                            [enabled]
  -mavx512vnni                          [disabled]
  -mavx512vpopcntdq                     [disabled]
  -mbionic                              [disabled]
  -mbmi                                 [enabled]
  -mbmi2                                [enabled]
  -mbranch-cost=<0,5>                   3
  -mcall-ms2sysv-xlogues                [disabled]
  -mcet
  -mcet-switch                          [disabled]
  -mcld                                 [disabled]
  -mclflushopt                          [enabled]
  -mclwb                                [enabled]
  -mclzero                              [disabled]
  -mcmodel=                             [default]
  -mcpu=
  -mcrc32                               [disabled]
  -mcx16                                [enabled]
  -mdispatch-scheduler                  [disabled]
  -mdump-tune-features                  [disabled]
  -mf16c                                [enabled]
  -mfancy-math-387                      [enabled]
  -mfentry                              [disabled]
  -mfma                                 [enabled]
  -mfma4                                [disabled]
  -mforce-drap                          [disabled]
  -mforce-indirect-call                 [disabled]
  -mfp-ret-in-387                       [enabled]
  -mfpmath=                             sse
  -mfsgsbase                            [enabled]
  -mfunction-return=                    keep
  -mfused-madd
  -mfxsr                                [enabled]
  -mgeneral-regs-only                   [disabled]
  -mgfni                                [disabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mhle                                 [enabled]
  -miamcu                               [disabled]
  -mieee-fp                             [enabled]
  -mincoming-stack-boundary=            0
  -mindirect-branch-register            [disabled]
  -mindirect-branch=                    keep
  -minline-all-stringops                [disabled]
  -minline-stringops-dynamically        [disabled]
  -mintel-syntax
  -mlarge-data-threshold=<number>       65536
  -mlong-double-128                     [disabled]
  -mlong-double-64                      [disabled]
  -mlong-double-80                      [enabled]
  -mlwp                                 [disabled]
  -mlzcnt                               [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmitigate-rop                        [disabled]
  -mmmx                                 [enabled]
  -mmovbe                               [enabled]
  -mmovdir64b                           [disabled]
  -mmovdiri                             [disabled]
  -mmpx                                 [disabled]
  -mms-bitfields                        [disabled]
  -mmusl                                [disabled]
  -mmwaitx                              [disabled]
  -mno-align-stringops                  [disabled]
  -mno-default                          [disabled]
  -mno-fancy-math-387                   [disabled]
  -mno-push-args                        [disabled]
  -mno-red-zone                         [disabled]
  -mno-sse4                             [disabled]
  -mnop-mcount                          [disabled]
  -momit-leaf-frame-pointer             [disabled]
  -mpc32                                [disabled]
  -mpc64                                [disabled]
  -mpc80                                [disabled]
  -mpclmul                              [enabled]
  -mpcommit                             [disabled]
  -mpconfig                             [disabled]
  -mpku                                 [disabled]
  -mpopcnt                              [enabled]
  -mprefer-avx128
  -mprefer-vector-width=                256
  -mpreferred-stack-boundary=           0
  -mprefetchwt1                         [disabled]
  -mprfchw                              [enabled]
  -mpush-args                           [enabled]
  -mrdpid                               [disabled]
  -mrdrnd                               [enabled]
  -mrdseed                              [enabled]
  -mrecip                               [disabled]
  -mrecip=
  -mrecord-mcount                       [disabled]
  -mred-zone                            [enabled]
  -mregparm=                            6
  -mrtd                                 [disabled]
  -mrtm                                 [enabled]
  -msahf                                [enabled]
  -msgx                                 [disabled]
  -msha                                 [disabled]
  -mshstk                               [disabled]
  -mskip-rax-setup                      [disabled]
  -msoft-float                          [disabled]
  -msse                                 [enabled]
  -msse2                                [enabled]
  -msse2avx                             [disabled]
  -msse3                                [enabled]
  -msse4                                [enabled]
  -msse4.1                              [enabled]
  -msse4.2                              [enabled]
  -msse4a                               [disabled]
  -msse5
  -msseregparm                          [disabled]
  -mssse3                               [enabled]
  -mstack-arg-probe                     [disabled]
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard-symbol=
  -mstack-protector-guard=              tls
  -mstackrealign                        [disabled]
  -mstringop-strategy=                  [default]
  -mstv                                 [enabled]
  -mtbm                                 [disabled]
  -mtls-dialect=                        gnu
  -mtls-direct-seg-refs                 [enabled]
  -mtune-ctrl=
  -mtune=                               skylake-avx512
  -muclibc                              [disabled]
  -mvaes                                [disabled]
  -mveclibabi=                          [default]
  -mvect8-ret-in-mem                    [disabled]
  -mvpclmulqdq                          [disabled]
  -mvzeroupper                          [enabled]
  -mwbnoinvd                            [disabled]
  -mx32                                 [disabled]
  -mxop                                 [disabled]
  -mxsave                               [enabled]
  -mxsavec                              [enabled]
  -mxsaveopt                            [enabled]
  -mxsaves                              [disabled]

  Known assembler dialects (for use with the -masm= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline

  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known preferred register vector length (to use with the -mprefer-vector-width= option)
    128 256 512 none

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

I also tried to use estrin's scheme from sleef, it does not accelerate.

ghost commented 5 years ago

I just found that in order to obtain better performance, I need to use -mavx512f -mavx512dq option in GCC.

With godbolt.org, I confirm that zmmxx register are used with that option.

(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11  -mavx512f -mavx512dq
(base) [root@JD SIMD-math-prims]# test_fun
-bash: test_fun: command not found
(base) [root@JD SIMD-math-prims]# ./test_fun
Sin functions:
--------------
Comparing the behavior of sinapprox against sinf, in the interval [-3.14159, 3.14159]:
Bias:                   -2.560597e-09
Mean absolute error:    3.676327e-06
RMS error:              4.109845e-06
Min difference:         -5.960464e-06
Max difference:         5.938113e-06

Comparing the behavior of sinapprox_d against sin, in the interval [-3.14159, 3.14159]:
Bias:                   1.151537e-12
Mean absolute error:    7.186951e-10
RMS error:              8.026634e-10
Min difference:         -1.145714e-09
Max difference:         1.145729e-09

Benchmarking sinf...    243.8M/s
Benchmarking sinapprox...    14628.6M/s
Benchmarking sin...    341.3M/s
Benchmarking sinapprox_d...    4876.2M/s

Cos functions:
--------------
Comparing the behavior of cosapprox against cosf, in the interval [-3.14159, 3.14159]:
Bias:                   5.895721e-06
Mean absolute error:    2.777401e-05
RMS error:              3.156854e-05
Min difference:         -4.571676e-05
Max difference:         4.577637e-05

Comparing the behavior of cosapprox_d against cos, in the interval [-3.14159, 3.14159]:
Bias:                   1.367417e-11
Mean absolute error:    7.584514e-11
RMS error:              8.525821e-11
Min difference:         -1.227697e-10
Max difference:         1.227682e-10

Benchmarking cosf...    262.6M/s
Benchmarking cosapprox...    17066.7M/s
Benchmarking cos...    379.3M/s
Benchmarking cosapprox_d...    4876.2M/s

Log functions:
--------------
Comparing the behavior of logapprox against logf, in the interval [1e-10, 10]:
Bias:                   7.545441e-08
Mean absolute error:    5.554401e-06
RMS error:              6.175767e-06
Min difference:         -8.821487e-06
Max difference:         8.940697e-06

Comparing the behavior of icsi_log against logf, in the interval [1e-10, 10]:
Bias:                   1.043380e-05
Mean absolute error:    1.756708e-04
RMS error:              2.058682e-04
Min difference:         -4.689693e-04
Max difference:         4.639626e-04

Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias:                   -1.875015e-10
Mean absolute error:    2.895113e-09
RMS error:              3.218112e-09
Min difference:         -4.531327e-09
Max difference:         4.531155e-09

Benchmarking logf...    124.9M/s
Benchmarking icsi_log...    787.7M/s
Benchmarking logapprox...    7876.9M/s
Benchmarking log...    124.9M/s
Benchmarking logapprox_d...    1932.1M/s

Exp functions:
--------------
Comparing the behavior of expapprox against expf, in the interval [-10, 10]:
Relative bias:                  -8.916760e-09
Mean relative error:            2.821630e-06
RMS relative error:             3.418452e-06
Min relative difference:        -8.690718e-06
Max relative difference:        8.117646e-06

Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]:
Relative bias:                  2.779902e-09
Mean relative error:            5.930965e-08
RMS relative error:             6.559258e-08
Min relative difference:        -9.237607e-08
Max relative difference:        9.237933e-08

Benchmarking expf...    129.6M/s
Benchmarking expapprox...    6023.5M/s
Benchmarking exp...    209.0M/s
Benchmarking expapprox_d...    2560.0M/s
jhjourdan commented 5 years ago

I still have a problem, how the upper and lower limit `` is derived?

double exp_cst1_d = 9218868437227405312.;
double exp_cst2_d = 0.;

These values are chosen to prevent overflow when synthesizing the exponent floating point number.

Anyway, the exp_d works and its accuracy of rtol = 1e-9 is satisfactory even in the whole range [-300, +200] of my parameter.

Good. BTW, if you want an extra burst of performance, you can remove these bounds checks, which actually take quite a bit of time.

jhjourdan commented 5 years ago

I just found that in order to obtain better performance, I need to use -mavx512f -mavx512dq option in GCC.

That's interesting, because your gcc -march=native -Q --help=target seems to indicate that these flags are already activated...

Anyway, is there anything I can do now, or can I close the issue?

jhjourdan commented 5 years ago
Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]:
Relative bias:                  2.779902e-09
Mean relative error:            5.930965e-08
RMS relative error:             6.559258e-08
Min relative difference:        -9.237607e-08
Max relative difference:        9.237933e-08

Have you changed the implementation of expapprox_d? On my computer, the error seems much better.

ghost commented 5 years ago

I still have some problem. logapprox_d cannot be vectorized with avx2's 256 bit instructions.

g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=256 -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized 
###   bench_fun_f(logapprox_d, 1000000L); is at line no. 128. Not vectorized.
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)

# test result
Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias:                   -1.875015e-10
Mean absolute error:    2.895113e-09
RMS error:              3.218112e-09
Min difference:         -4.531327e-09
Max difference:         4.531155e-09

Benchmarking logf...    126.4M/s
Benchmarking icsi_log...    731.4M/s
Benchmarking logapprox...    4452.2M/s
Benchmarking log...    123.4M/s
Benchmarking logapprox_d...    308.4M/s

But I see there are a lot of ymm** register used in godbolt

ghost commented 5 years ago

With -mprefer-vector-width=512 option. gcc 8.3.1-3 works as expected.

(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=512 -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:128:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop with 3 iterations completely unrolled (header execution count 7087537)
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
ghost commented 5 years ago

I just found it seems even 256bit version of vpsraq instruct is from avx512DQ or avx512VL.

jhjourdan commented 5 years ago

I still have some problem. logapprox_d cannot be vectorized with avx2's 256 bit instructions.

It seems like godbolt did the vectorization, but the GCC you are using on your machine did not.

Which version of GCC are you using?

ghost commented 5 years ago
(base) [root@JD SIMD-math-prims]# git status
On branch master
Your branch is up to date with 'origin/master'.

(base) [root@JD SIMD-math-prims]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)

(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=512 -mavx512dq -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:128:3: note: loop vectorized // vectorized!
test_fun.cpp:126:3: note: loop vectorized
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop with 3 iterations completely unrolled (header execution count 7087537)
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=256 -mavx512dq -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized //line 128 is not vectorized.
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)

-v -Q --help=target -march=native output:

  1. this is from godbolt using gcc 8.3.0

    
    The following options are target specific:
    
    -m128bit-long-double              [enabled]
    
    -m16                              [disabled]
    
    -m32                              [disabled]
    
    -m3dnow                           [disabled]
    
    -m3dnowa                          [disabled]
    
    -m64                              [enabled]
    
    -m80387                           [enabled]
    
    -m8bit-idiv                       [disabled]
    
    -m96bit-long-double               [disabled]
    
    -mabi=                            sysv
    
    -mabm                             [enabled]
    
    -maccumulate-outgoing-args        [disabled]
    
    -maddress-mode=                   long
    
    -madx                             [enabled]
    
    -maes                             [enabled]
    
    -malign-data=                     compat
    
    -malign-double                    [disabled]
    
    -malign-functions=                0
    
    -malign-jumps=                    0
    
    -malign-loops=                    0
    
    -malign-stringops                 [enabled]
    
    -mandroid                         [disabled]
    
    -march=                           skylake-avx512
    
    -masm=                            intel
    
    -mavx                             [enabled]
    
    -mavx2                            [enabled]
    
    -mavx256-split-unaligned-load     [disabled]
    
    -mavx256-split-unaligned-store    [disabled]
    
    -mavx5124fmaps                    [disabled]
    
    -mavx5124vnniw                    [disabled]
    
    -mavx512bitalg                    [disabled]
    
    -mavx512bw                        [enabled]
    
    -mavx512cd                        [enabled]
    
    -mavx512dq                        [enabled]
    
    -mavx512er                        [disabled]
    
    -mavx512f                         [enabled]
    
    -mavx512ifma                      [disabled]
    
    -mavx512pf                        [disabled]
    
    -mavx512vbmi                      [disabled]
    
    -mavx512vbmi2                     [disabled]
    
    -mavx512vl                        [enabled]
    
    -mavx512vnni                      [disabled]
    
    -mavx512vpopcntdq                 [disabled]
    
    -mbionic                          [disabled]
    
    -mbmi                             [enabled]
    
    -mbmi2                            [enabled]
    
    -mbranch-cost=<0,5>               3
    
    -mcall-ms2sysv-xlogues            [disabled]
    
    -mcet-switch                      [disabled]
    
    -mcld                             [disabled]
    
    -mclflushopt                      [enabled]
    
    -mclwb                            [enabled]
    
    -mclzero                          [disabled]
    
    -mcmodel=                         [default]
    
    -mcpu=                            
    
    -mcrc32                           [disabled]
    
    -mcx16                            [enabled]
    
    -mdispatch-scheduler              [disabled]
    
    -mdump-tune-features              [disabled]
    
    -mf16c                            [enabled]
    
    -mfancy-math-387                  [enabled]
    
    -mfentry                          [disabled]
    
    -mfma                             [enabled]
    
    -mfma4                            [disabled]
    
    -mforce-drap                      [disabled]
    
    -mforce-indirect-call             [disabled]
    
    -mfp-ret-in-387                   [enabled]
    
    -mfpmath=                         sse
    
    -mfsgsbase                        [enabled]
    
    -mfunction-return=                keep
    
    -mfused-madd                      
    
    -mfxsr                            [enabled]
    
    -mgeneral-regs-only               [disabled]
    
    -mgfni                            [disabled]
    
    -mglibc                           [enabled]
    
    -mhard-float                      [enabled]
    
    -mhle                             [enabled]
    
    -miamcu                           [disabled]
    
    -mieee-fp                         [disabled]
    
    -mincoming-stack-boundary=        0
    
    -mindirect-branch-register        [disabled]
    
    -mindirect-branch=                keep
    
    -minline-all-stringops            [disabled]
    
    -minline-stringops-dynamically    [disabled]
    
    -mintel-syntax                    
    
    -mlarge-data-threshold=<number>   65536
    
    -mlong-double-128                 [disabled]
    
    -mlong-double-64                  [disabled]
    
    -mlong-double-80                  [enabled]
    
    -mlwp                             [disabled]
    
    -mlzcnt                           [enabled]
    
    -mmemcpy-strategy=                
    
    -mmemset-strategy=                
    
    -mmitigate-rop                    [disabled]
    
    -mmmx                             [enabled]
    
    -mmovbe                           [enabled]
    
    -mmovdir64b                       [disabled]
    
    -mmovdiri                         [disabled]
    
    -mmpx                             [disabled]
    
    -mms-bitfields                    [disabled]
    
    -mmusl                            [disabled]
    
    -mmwaitx                          [disabled]
    
    -mno-align-stringops              [disabled]
    
    -mno-default                      [disabled]
    
    -mno-fancy-math-387               [disabled]
    
    -mno-push-args                    [disabled]
    
    -mno-red-zone                     [disabled]
    
    -mno-sse4                         [disabled]
    
    -mnop-mcount                      [disabled]
    
    -momit-leaf-frame-pointer         [disabled]
    
    -mpc32                            [disabled]
    
    -mpc64                            [disabled]
    
    -mpc80                            [disabled]
    
    -mpclmul                          [enabled]
    
    -mpcommit                         [disabled]
    
    -mpconfig                         [disabled]
    
    -mpku                             [enabled]
    
    -mpopcnt                          [enabled]
    
    -mprefer-avx128                   
    
    -mprefer-vector-width=            256
    
    -mpreferred-stack-boundary=       0
    
    -mprefetchwt1                     [disabled]
    
    -mprfchw                          [enabled]
    
    -mpush-args                       [enabled]
    
    -mrdpid                           [disabled]
    
    -mrdrnd                           [enabled]
    
    -mrdseed                          [enabled]
    
    -mrecip                           [disabled]
    
    -mrecip=                          
    
    -mrecord-mcount                   [disabled]
    
    -mred-zone                        [enabled]
    
    -mregparm=                        6
    
    -mrtd                             [disabled]
    
    -mrtm                             [enabled]
    
    -msahf                            [enabled]
    
    -msgx                             [disabled]
    
    -msha                             [disabled]
    
    -mshstk                           [disabled]
    
    -mskip-rax-setup                  [disabled]
    
    -msoft-float                      [disabled]
    
    -msse                             [enabled]
    
    -msse2                            [enabled]
    
    -msse2avx                         [disabled]
    
    -msse3                            [enabled]
    
    -msse4                            [enabled]
    
    -msse4.1                          [enabled]
    
    -msse4.2                          [enabled]
    
    -msse4a                           [disabled]
    
    -msse5                            
    
    -msseregparm                      [disabled]
    
    -mssse3                           [enabled]
    
    -mstack-arg-probe                 [disabled]
    
    -mstack-protector-guard-offset=   
    
    -mstack-protector-guard-reg=      
    
    -mstack-protector-guard-symbol=   
    
    -mstack-protector-guard=          tls
    
    -mstackrealign                    [disabled]
    
    -mstringop-strategy=              [default]
    
    -mstv                             [enabled]
    
    -mtbm                             [disabled]
    
    -mtls-dialect=                    gnu
    
    -mtls-direct-seg-refs             [enabled]
    
    -mtune-ctrl=                      
    
    -mtune=                           skylake-avx512
    
    -muclibc                          [disabled]
    
    -mvaes                            [disabled]
    
    -mveclibabi=                      [default]
    
    -mvect8-ret-in-mem                [disabled]
    
    -mvpclmulqdq                      [disabled]
    
    -mvzeroupper                      [enabled]
    
    -mwbnoinvd                        [disabled]
    
    -mx32                             [disabled]
    
    -mxop                             [disabled]
    
    -mxsave                           [enabled]
    
    -mxsavec                          [enabled]
    
    -mxsaveopt                        [enabled]
    
    -mxsaves                          [enabled]
    
    Known assembler dialects (for use with the -masm= option):
    
    att intel
    
    Known ABIs (for use with the -mabi= option):
    
    ms sysv
    
    Known code models (for use with the -mcmodel= option):
    
    32 kernel large medium small
    
    Valid arguments to -mfpmath=:
    
    387 387+sse 387,sse both sse sse+387 sse,387
    
    Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    
    keep thunk thunk-extern thunk-inline
    
    Known data alignment choices (for use with the -malign-data= option):
    
    abi cacheline compat
    
    Known vectorization library ABIs (for use with the -mveclibabi= option):
    
    acml svml
    
    Known address mode (for use with the -maddress-mode= option):
    
    long short
    
    Known preferred register vector length (to use with the -mprefer-vector-width= option)
    
    128 256 512 none
    
    Known stack protector guard (for use with the -mstack-protector-guard= option):
    
    global tls
    
    Valid arguments to -mstringop-strategy=:
    
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop
    
    vector_loop
    
    Known TLS dialects (for use with the -mtls-dialect= option):
    
    gnu gnu2

Using built-in specs.

COLLECT_GCC=/opt/compiler-explorer/gcc-8.3.0/bin/gcc

Target: x86_64-linux-gnu

Configured with: ../gcc-8.3.0/configure --prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran,ada --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=Compiler-Explorer-Build

Thread model: posix

gcc version 8.3.0 (Compiler-Explorer-Build)

COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o' './output.s' '-masm=intel' '-S' '-v' '-O3' '-ffast-math' '-fopenmp' '-march=native' '-Q' '--help=target' '-pthread'

/opt/compiler-explorer/gcc-8.3.0/bin/../libexec/gcc/x86_64-linux-gnu/8.3.0/cc1 -v -imultiarch x86_64-linux-gnu -iprefix /opt/compiler-explorer/gcc-8.3.0/bin/../lib/gcc/x86_64-linux-gnu/8.3.0/ -D_REENTRANT help-dummy -march=skylake-avx512 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mavx512f -mno-avx512er -mavx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mavx512dq -mavx512bw -mavx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mclwb -mno-mwaitx -mno-clzero -mpku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=33792 -mtune=skylake-avx512 -dumpbase help-dummy -masm=intel -auxbase-strip ./output.s -g -O3 -version -fdiagnostics-color=always -ffast-math -fopenmp --help=target -o ./output.s

GNU C17 (Compiler-Explorer-Build) version 8.3.0 (x86_64-linux-gnu)

compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.18-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072

Compiler returned: 0


2. This is from my vps
```shell
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
COLLECT_GCC_OPTIONS='-march=native' '-Q' '--help=target' '-v'
 /opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/cc1 -v help-dummy -march=skylake-avx512 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mavx512f -mno-avx512er -mavx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mno-xsaves -mavx512dq -mavx512bw -mavx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mclwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake-avx512 -dumpbase help-dummy -auxbase help-dummy -version --help=target -o /tmp/cc3URSRH.s
The following options are target specific:
  -m128bit-long-double                  [enabled]
  -m16                                  [disabled]
  -m32                                  [disabled]
  -m3dnow                               [disabled]
  -m3dnowa                              [disabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -m8bit-idiv                           [disabled]
  -m96bit-long-double                   [disabled]
  -mabi=                                sysv
  -mabm                                 [enabled]
  -maccumulate-outgoing-args            [disabled]
  -maddress-mode=                       long
  -madx                                 [enabled]
  -maes                                 [enabled]
  -malign-data=                         compat
  -malign-double                        [disabled]
  -malign-functions=                    0
  -malign-jumps=                        0
  -malign-loops=                        0
  -malign-stringops                     [enabled]
  -mandroid                             [disabled]
  -march=                               skylake-avx512
  -masm=                                att
  -mavx                                 [enabled]
  -mavx2                                [enabled]
  -mavx256-split-unaligned-load         [disabled]
  -mavx256-split-unaligned-store        [disabled]
  -mavx5124fmaps                        [disabled]
  -mavx5124vnniw                        [disabled]
  -mavx512bitalg                        [disabled]
  -mavx512bw                            [enabled]
  -mavx512cd                            [enabled]
  -mavx512dq                            [enabled]
  -mavx512er                            [disabled]
  -mavx512f                             [enabled]
  -mavx512ifma                          [disabled]
  -mavx512pf                            [disabled]
  -mavx512vbmi                          [disabled]
  -mavx512vbmi2                         [disabled]
  -mavx512vl                            [enabled]
  -mavx512vnni                          [disabled]
  -mavx512vpopcntdq                     [disabled]
  -mbionic                              [disabled]
  -mbmi                                 [enabled]
  -mbmi2                                [enabled]
  -mbranch-cost=<0,5>                   3
  -mcall-ms2sysv-xlogues                [disabled]
  -mcet
  -mcet-switch                          [disabled]
  -mcld                                 [disabled]
  -mclflushopt                          [enabled]
  -mclwb                                [enabled]
  -mclzero                              [disabled]
  -mcmodel=                             [default]
  -mcpu=
  -mcrc32                               [disabled]
  -mcx16                                [enabled]
  -mdispatch-scheduler                  [disabled]
  -mdump-tune-features                  [disabled]
  -mf16c                                [enabled]
  -mfancy-math-387                      [enabled]
  -mfentry                              [disabled]
  -mfma                                 [enabled]
  -mfma4                                [disabled]
  -mforce-drap                          [disabled]
  -mforce-indirect-call                 [disabled]
  -mfp-ret-in-387                       [enabled]
  -mfpmath=                             sse
  -mfsgsbase                            [enabled]
  -mfunction-return=                    keep
  -mfused-madd
  -mfxsr                                [enabled]
  -mgeneral-regs-only                   [disabled]
  -mgfni                                [disabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mhle                                 [enabled]
  -miamcu                               [disabled]
  -mieee-fp                             [enabled]
  -mincoming-stack-boundary=            0
  -mindirect-branch-register            [disabled]
  -mindirect-branch=                    keep
  -minline-all-stringops                [disabled]
  -minline-stringops-dynamically        [disabled]
  -mintel-syntax
  -mlarge-data-threshold=<number>       65536
  -mlong-double-128                     [disabled]
  -mlong-double-64                      [disabled]
  -mlong-double-80                      [enabled]
  -mlwp                                 [disabled]
  -mlzcnt                               [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmitigate-rop                        [disabled]
  -mmmx                                 [enabled]
  -mmovbe                               [enabled]
  -mmovdir64b                           [disabled]
  -mmovdiri                             [disabled]
  -mmpx                                 [disabled]
  -mms-bitfields                        [disabled]
  -mmusl                                [disabled]
  -mmwaitx                              [disabled]
  -mno-align-stringops                  [disabled]
  -mno-default                          [disabled]
  -mno-fancy-math-387                   [disabled]
  -mno-push-args                        [disabled]
  -mno-red-zone                         [disabled]
  -mno-sse4                             [disabled]
  -mnop-mcount                          [disabled]
  -momit-leaf-frame-pointer             [disabled]
  -mpc32                                [disabled]
  -mpc64                                [disabled]
  -mpc80                                [disabled]
  -mpclmul                              [enabled]
  -mpcommit                             [disabled]
  -mpconfig                             [disabled]
  -mpku                                 [disabled]
  -mpopcnt                              [enabled]
  -mprefer-avx128
  -mprefer-vector-width=                256
  -mpreferred-stack-boundary=           0
  -mprefetchwt1                         [disabled]
  -mprfchw                              [enabled]
  -mpush-args                           [enabled]
  -mrdpid                               [disabled]
  -mrdrnd                               [enabled]
  -mrdseed                              [enabled]
  -mrecip                               [disabled]
  -mrecip=
  -mrecord-mcount                       [disabled]
  -mred-zone                            [enabled]
  -mregparm=                            6
  -mrtd                                 [disabled]
  -mrtm                                 [enabled]
  -msahf                                [enabled]
  -msgx                                 [disabled]
  -msha                                 [disabled]
  -mshstk                               [disabled]
  -mskip-rax-setup                      [disabled]
  -msoft-float                          [disabled]
  -msse                                 [enabled]
  -msse2                                [enabled]
  -msse2avx                             [disabled]
  -msse3                                [enabled]
  -msse4                                [enabled]
  -msse4.1                              [enabled]
  -msse4.2                              [enabled]
  -msse4a                               [disabled]
  -msse5
  -msseregparm                          [disabled]
  -mssse3                               [enabled]
  -mstack-arg-probe                     [disabled]
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard-symbol=
  -mstack-protector-guard=              tls
  -mstackrealign                        [disabled]
  -mstringop-strategy=                  [default]
  -mstv                                 [enabled]
  -mtbm                                 [disabled]
  -mtls-dialect=                        gnu
  -mtls-direct-seg-refs                 [enabled]
  -mtune-ctrl=
  -mtune=                               skylake-avx512
  -muclibc                              [disabled]
  -mvaes                                [disabled]
  -mveclibabi=                          [default]
  -mvect8-ret-in-mem                    [disabled]
  -mvpclmulqdq                          [disabled]
  -mvzeroupper                          [enabled]
  -mwbnoinvd                            [disabled]
  -mx32                                 [disabled]
  -mxop                                 [disabled]
  -mxsave                               [enabled]
  -mxsavec                              [enabled]
  -mxsaveopt                            [enabled]
  -mxsaves                              [disabled]

  Known assembler dialects (for use with the -masm= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline

  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known preferred register vector length (to use with the -mprefer-vector-width= option)
    128 256 512 none

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

GNU C17 (GCC) version 8.3.1 20190311 (Red Hat 8.3.1-3) (x86_64-redhat-linux)
        compiled by GNU C version 8.3.1 20190311 (Red Hat 8.3.1-3), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version isl-0.16.1-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
COLLECT_GCC_OPTIONS='-march=native' '-Q' '--help=target' '-v'
 /opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/as -v --64 -o /tmp/ccBg1zy2.o /tmp/cc3URSRH.s
GNU assembler version 2.30 (x86_64-redhat-linux) using BFD version version 2.30-54.el7
jhjourdan commented 5 years ago

Alright, I found the issue. This should work with the current master. I was simply using bench_fun_f instead of bench_fun_d. This means that the code was computing the logarithm in double, but with float operands/results. This resulted in float<->double conversions, which apparently are not supported if the vector size is 256.

ghost commented 5 years ago

Is is possible to circumvent this problem by some magic like this for a limited range and little endian cpu only ?

inline int double2int( double d )
{
   union Cast
   {
      double d;
      int i;
    }  c;
   const double magic = 1.5*(1LL<<52);
   c.d = d + magic;
   return c.i;
}

from: https://stackoverflow.com/a/429812

jhjourdan commented 5 years ago

What problem are you speaking about?

ghost commented 5 years ago

Sorry, maybe my last reply is irrelevant to this issue.

Because of the algorithmic right shift operation >>52 is inevitable but it cannot be vectorized without avx512dq instructions, I can never expect to get this double version code accelerated as much as float32 version in non-avx512dq CPU, even the CPU have avx2 instruction set and 256bit registers available. Right?

jhjourdan commented 5 years ago

Indeed, I don't see a way to do that. The right shift is not the only issue: conversions between 64bits integers and doubles are neither available.