Closed ghost closed 5 years ago
If you want double precision, then I assume you want more than 24bits relative precision. Then, you need to use a polynomial with higher degree, and the computation will slowdown accordingly. Except if you are using vectorization this might not be worth it.
Anyway, if you want more detail about the use of Sollya, you may want to read the following blog post:
As far as I see in the code you did not use floor()
function but use some magic numbers
:
float exp_cst1 = 2139095040.f; //0x1.fe00000000000p+30
float exp_cst2 = 0.f;
val2 = 12102203.1615614f*val+1065353216.f;
// 12102203.1615614f == 0x1.fffffffffffc0p+22/ log(2)
// 1065353216.f == 0x1.fc00000000000p+29
I am wondering how can I get those numbers for double precision as a substitute of floor()
function.
I want to use the vectorizable exp and log in chemical kinetics so double precision is needed. And slightly higher accuracy than 1e-5, such as 1e-7, is needed. I am not sure if I can get enough acceleration on CPUs with AVX512 extension. But it is worth trying.
I installed sollya
and I can understand the float point tricks in the blog you mentioned. The polynomial must be based on the basis functions with zero in both ends, so you used (x-1)(x-2)x**i bases to keep the function continous.
Besides, for higher degree polynomial evaluation, Estrin's scheme
or other numerical polynomial evaluation method might be faster than Horner's method
. But I am not sure if it is the case for SIMD vectorized function.
I've just pushed 8067be0 for you. This commit implements double, more accurate versions of the mathematical functions, in the same fashion as the old ones.
The main pitfall is that the new functions for log
and exp
, logapprox
and expapprox
can only be vectorized on very recent Intel processors featuring the AVX512DQ extensions. The reason is that double->int64 casts and 64bits bit shifts are only supported in a vectorized fashion by processors supporting these extensions. I don't know which processor you are targeting, but I wouldn't bet they support them.
I still have a little confusion about the magic numbers.
First, the value of double number r
is:
for a bit pattern of a number like:
|S|E_10, ... , E_0| M_51, M_50, ... , M_0|
Sorry for my poor c language. I thought you used some shift technique to get the integer part and fraction part of the number. Actually, you used the val4i = (int64_t) val4;
to convert double
to int_64t
. It looks like floor()
function (at least to positive numbers). It is a type cast and approximation rather than a reinterpret_cast
. And the fraction part is obtained by some shift and mask operations.
So the idea of the implementation of exp
is the following : First, you reduce to compute the exponential in base 2 by dividing by log(2)
. Then, you compute the floor and use the resulting value as the floatting-point exponent of the result. Finally, you approximate the exponential of the fractional part with a well-chosen polynomial. The product of the two parts give the final result.
Now, how should we implement that in practice? We can of course not use the floor
function of stdlib, because it is a library function, which will lead to failure of vectorization. Casting to int directly would perfom rounding to 0, which is not what we want for negative values of the input, and, moreover, we would need one subsequent costly int->float conversion to get the fractional part.
Instead, you convert the input value to a well-chosen fixed-point format (encoded in an integer), and use the high-order bits for the integral part of the input value and the low-order bits as the fractional part. Note that (in floats) exponents are shifted by 127, so that we will need to add 127 somehow to the exponent.
When using the right fixed-point format (i.e, 23 bits for the fractional part), we only need bitwise masking (no shifting) for synthesizing the two floats in interest:
So, I hope you now understand the values of the magic constants:
12102203.1615614f
is 2^23/log(2)
. Multiplying with this constant both transform the problem to finding the base 2 exponential and allow to prepare for the fixed-point format.1065353216.f
correspond to adding 127 to the exponent of the first part of the result.int32_t
finishes the fixed-point conversion.Hope this helps...
BTW, will you be able to use my double version of the functions, in the end?
Thank you very much for the detail of implementation.
I see the fixed point algorithm. You are constructing a binary fixed point number as a bit pattern like:
0bxxxxxx.yyyyyyyyyyyy
by using some floating point algebra operations. And convert it to int32_t
type.
The steps:
b
as a float
is actually an integer because M part of float
a is 23bit.c
as a float
is still integer valued. Actually, it should be (a+127)(2<<23)vcvttpd2qq
instruction, convert a float
integer value to int32_t
.Now, d
as an int32_t
is an integer. d
as a bit pattern is a fixed point binary number. The integer part of d
as a bit pattern can be extracted by &
operation.
I still have a problem, how the upper and lower limit `` is derived?
double exp_cst1_d = 9218868437227405312.;
double exp_cst2_d = 0.;
Anyway, the exp_d
works and its accuracy of rtol = 1e-9
is satisfactory even in the whole range [-300, +200] of my parameter. But log_d
do not accelerate well as single precision version.
Here is the output:
(base) [root@JD SIMD-math-prims]# make
g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native
(base) [root@JD SIMD-math-prims]# ./test_fun
Sin functions:
--------------
Comparing the behavior of sinapprox against sinf, in the interval [-3.14159, 3.14159]:
Bias: -2.560597e-09
Mean absolute error: 3.676327e-06
RMS error: 4.109845e-06
Min difference: -5.960464e-06
Max difference: 5.938113e-06
Comparing the behavior of sinapprox_d against sin, in the interval [-3.14159, 3.14159]:
Bias: 1.151537e-12
Mean absolute error: 7.186951e-10
RMS error: 8.026634e-10
Min difference: -1.145714e-09
Max difference: 1.145729e-09
Benchmarking sinf... 243.8M/s
Benchmarking sinapprox... 7876.9M/s
Benchmarking sin... 341.3M/s
Benchmarking sinapprox_d... 2694.7M/s
Cos functions:
--------------
Comparing the behavior of cosapprox against cosf, in the interval [-3.14159, 3.14159]:
Bias: 5.895721e-06
Mean absolute error: 2.777401e-05
RMS error: 3.156854e-05
Min difference: -4.571676e-05
Max difference: 4.577637e-05
Comparing the behavior of cosapprox_d against cos, in the interval [-3.14159, 3.14159]:
Bias: 1.367417e-11
Mean absolute error: 7.584514e-11
RMS error: 8.525821e-11
Min difference: -1.227697e-10
Max difference: 1.227682e-10
Benchmarking cosf... 262.6M/s
Benchmarking cosapprox... 9309.1M/s
Benchmarking cos... 379.3M/s
Benchmarking cosapprox_d... 2694.7M/s
Log functions:
--------------
Comparing the behavior of logapprox against logf, in the interval [1e-10, 10]:
Bias: 7.545441e-08
Mean absolute error: 5.554401e-06
RMS error: 6.175767e-06
Min difference: -8.821487e-06
Max difference: 8.940697e-06
Comparing the behavior of icsi_log against logf, in the interval [1e-10, 10]:
Bias: 1.043380e-05
Mean absolute error: 1.756708e-04
RMS error: 2.058682e-04
Min difference: -4.689693e-04
Max difference: 4.639626e-04
Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias: -1.875015e-10
Mean absolute error: 2.895113e-09
RMS error: 3.218112e-09
Min difference: -4.531327e-09
Max difference: 4.531155e-09
Benchmarking logf... 124.9M/s
Benchmarking icsi_log... 731.4M/s
Benchmarking logapprox... 4452.2M/s
Benchmarking log... 123.4M/s
Benchmarking logapprox_d... 307.5M/s
Exp functions:
--------------
Comparing the behavior of expapprox against expf, in the interval [-10, 10]:
Relative bias: -8.916760e-09
Mean relative error: 2.821630e-06
RMS relative error: 3.418452e-06
Min relative difference: -8.690718e-06
Max relative difference: 8.117646e-06
Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]:
Relative bias: -1.935681e-11
Mean relative error: 1.417697e-09
RMS relative error: 1.573945e-09
Min relative difference: -2.218639e-09
Max relative difference: 2.218602e-09
Benchmarking expf... 131.3M/s
Benchmarking expapprox... 3413.3M/s
Benchmarking exp... 204.8M/s
Benchmarking expapprox_d... 1383.8M/s
(base) [root@JD SIMD-math-prims]# uname -a
Linux JD 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
(base) [root@JD SIMD-math-prims]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3192.500
BogoMIPS: 6385.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ibrs ibpb fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat spec_ctrl
(base) [root@JD SIMD-math-prims]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
(base) [root@JD SIMD-math-prims]# gcc -march=native -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mandroid [disabled]
-march= skylake-avx512
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [disabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bitalg [disabled]
-mavx512bw [enabled]
-mavx512cd [enabled]
-mavx512dq [enabled]
-mavx512er [disabled]
-mavx512f [enabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [enabled]
-mavx512vnni [disabled]
-mavx512vpopcntdq [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet
-mcet-switch [disabled]
-mcld [disabled]
-mclflushopt [enabled]
-mclwb [enabled]
-mclzero [disabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [enabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-mintel-syntax
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128
-mprefer-vector-width= 256
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [enabled]
-msahf [enabled]
-msgx [disabled]
-msha [disabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [disabled]
-msse5
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtune-ctrl=
-mtune= skylake-avx512
-muclibc [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwbnoinvd [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [disabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option)
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
I also tried to use estrin's scheme from sleef, it does not accelerate.
I just found that in order to obtain better performance, I need to use -mavx512f -mavx512dq
option in GCC.
With godbolt.org
, I confirm that zmmxx
register are used with that option.
(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -mavx512f -mavx512dq
(base) [root@JD SIMD-math-prims]# test_fun
-bash: test_fun: command not found
(base) [root@JD SIMD-math-prims]# ./test_fun
Sin functions:
--------------
Comparing the behavior of sinapprox against sinf, in the interval [-3.14159, 3.14159]:
Bias: -2.560597e-09
Mean absolute error: 3.676327e-06
RMS error: 4.109845e-06
Min difference: -5.960464e-06
Max difference: 5.938113e-06
Comparing the behavior of sinapprox_d against sin, in the interval [-3.14159, 3.14159]:
Bias: 1.151537e-12
Mean absolute error: 7.186951e-10
RMS error: 8.026634e-10
Min difference: -1.145714e-09
Max difference: 1.145729e-09
Benchmarking sinf... 243.8M/s
Benchmarking sinapprox... 14628.6M/s
Benchmarking sin... 341.3M/s
Benchmarking sinapprox_d... 4876.2M/s
Cos functions:
--------------
Comparing the behavior of cosapprox against cosf, in the interval [-3.14159, 3.14159]:
Bias: 5.895721e-06
Mean absolute error: 2.777401e-05
RMS error: 3.156854e-05
Min difference: -4.571676e-05
Max difference: 4.577637e-05
Comparing the behavior of cosapprox_d against cos, in the interval [-3.14159, 3.14159]:
Bias: 1.367417e-11
Mean absolute error: 7.584514e-11
RMS error: 8.525821e-11
Min difference: -1.227697e-10
Max difference: 1.227682e-10
Benchmarking cosf... 262.6M/s
Benchmarking cosapprox... 17066.7M/s
Benchmarking cos... 379.3M/s
Benchmarking cosapprox_d... 4876.2M/s
Log functions:
--------------
Comparing the behavior of logapprox against logf, in the interval [1e-10, 10]:
Bias: 7.545441e-08
Mean absolute error: 5.554401e-06
RMS error: 6.175767e-06
Min difference: -8.821487e-06
Max difference: 8.940697e-06
Comparing the behavior of icsi_log against logf, in the interval [1e-10, 10]:
Bias: 1.043380e-05
Mean absolute error: 1.756708e-04
RMS error: 2.058682e-04
Min difference: -4.689693e-04
Max difference: 4.639626e-04
Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias: -1.875015e-10
Mean absolute error: 2.895113e-09
RMS error: 3.218112e-09
Min difference: -4.531327e-09
Max difference: 4.531155e-09
Benchmarking logf... 124.9M/s
Benchmarking icsi_log... 787.7M/s
Benchmarking logapprox... 7876.9M/s
Benchmarking log... 124.9M/s
Benchmarking logapprox_d... 1932.1M/s
Exp functions:
--------------
Comparing the behavior of expapprox against expf, in the interval [-10, 10]:
Relative bias: -8.916760e-09
Mean relative error: 2.821630e-06
RMS relative error: 3.418452e-06
Min relative difference: -8.690718e-06
Max relative difference: 8.117646e-06
Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]:
Relative bias: 2.779902e-09
Mean relative error: 5.930965e-08
RMS relative error: 6.559258e-08
Min relative difference: -9.237607e-08
Max relative difference: 9.237933e-08
Benchmarking expf... 129.6M/s
Benchmarking expapprox... 6023.5M/s
Benchmarking exp... 209.0M/s
Benchmarking expapprox_d... 2560.0M/s
I still have a problem, how the upper and lower limit `` is derived?
double exp_cst1_d = 9218868437227405312.; double exp_cst2_d = 0.;
These values are chosen to prevent overflow when synthesizing the exponent floating point number.
Anyway, the exp_d works and its accuracy of rtol = 1e-9 is satisfactory even in the whole range [-300, +200] of my parameter.
Good. BTW, if you want an extra burst of performance, you can remove these bounds checks, which actually take quite a bit of time.
I just found that in order to obtain better performance, I need to use -mavx512f -mavx512dq option in GCC.
That's interesting, because your gcc -march=native -Q --help=target
seems to indicate that these flags are already activated...
Anyway, is there anything I can do now, or can I close the issue?
Comparing the behavior of expapprox_d against exp, in the interval [-10, 10]: Relative bias: 2.779902e-09 Mean relative error: 5.930965e-08 RMS relative error: 6.559258e-08 Min relative difference: -9.237607e-08 Max relative difference: 9.237933e-08
Have you changed the implementation of expapprox_d
? On my computer, the error seems much better.
I still have some problem. logapprox_d
cannot be vectorized with avx2's 256 bit instructions.
g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=256 -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized
### bench_fun_f(logapprox_d, 1000000L); is at line no. 128. Not vectorized.
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
# test result
Comparing the behavior of logapprox_d against log, in the interval [1e-10, 10]:
Bias: -1.875015e-10
Mean absolute error: 2.895113e-09
RMS error: 3.218112e-09
Min difference: -4.531327e-09
Max difference: 4.531155e-09
Benchmarking logf... 126.4M/s
Benchmarking icsi_log... 731.4M/s
Benchmarking logapprox... 4452.2M/s
Benchmarking log... 123.4M/s
Benchmarking logapprox_d... 308.4M/s
But I see there are a lot of ymm**
register used in godbolt
With -mprefer-vector-width=512
option. gcc 8.3.1-3 works as expected.
(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=512 -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:128:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop with 3 iterations completely unrolled (header execution count 7087537)
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
I just found it seems even 256bit version of vpsraq instruct is from avx512DQ or avx512VL.
I still have some problem. logapprox_d cannot be vectorized with avx2's 256 bit instructions.
It seems like godbolt did the vectorization, but the GCC you are using on your machine did not.
Which version of GCC are you using?
(base) [root@JD SIMD-math-prims]# git status
On branch master
Your branch is up to date with 'origin/master'.
(base) [root@JD SIMD-math-prims]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=512 -mavx512dq -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:128:3: note: loop vectorized // vectorized!
test_fun.cpp:126:3: note: loop vectorized
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop with 3 iterations completely unrolled (header execution count 7087537)
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
(base) [root@JD SIMD-math-prims]# g++ test_fun.cpp -o test_fun -Wall -W -O3 -std=c++11 -march=native -mprefer-vector-width=256 -mavx512dq -fopt-info
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algobase.h:753:13: note: Loop 2 distributed: split to 0 loops and 1 library calls.
test_fun.cpp:138:3: note: loop vectorized
test_fun.cpp:136:3: note: loop vectorized
test_fun.cpp:126:3: note: loop vectorized //line 128 is not vectorized.
test_fun.cpp:115:3: note: loop vectorized
test_fun.cpp:113:3: note: loop vectorized
test_fun.cpp:105:3: note: loop vectorized
test_fun.cpp:103:3: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:409:42: note: loop vectorized
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/random.tcc:401:32: note: loop vectorized
test_fun.cpp:90:5: note: loop with 2 iterations completely unrolled (header execution count 7087540)
-v -Q --help=target -march=native
output:this is from godbolt using gcc 8.3.0
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mandroid [disabled]
-march= skylake-avx512
-masm= intel
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [disabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bitalg [disabled]
-mavx512bw [enabled]
-mavx512cd [enabled]
-mavx512dq [enabled]
-mavx512er [disabled]
-mavx512f [enabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [enabled]
-mavx512vnni [disabled]
-mavx512vpopcntdq [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mclflushopt [enabled]
-mclwb [enabled]
-mclzero [disabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [enabled]
-miamcu [disabled]
-mieee-fp [disabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-mintel-syntax
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [enabled]
-mpopcnt [enabled]
-mprefer-avx128
-mprefer-vector-width= 256
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [enabled]
-msahf [enabled]
-msgx [disabled]
-msha [disabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [disabled]
-msse5
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtune-ctrl=
-mtune= skylake-avx512
-muclibc [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwbnoinvd [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option)
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop
vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Using built-in specs.
COLLECT_GCC=/opt/compiler-explorer/gcc-8.3.0/bin/gcc
Target: x86_64-linux-gnu
Configured with: ../gcc-8.3.0/configure --prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran,ada --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=Compiler-Explorer-Build
Thread model: posix
gcc version 8.3.0 (Compiler-Explorer-Build)
COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o' './output.s' '-masm=intel' '-S' '-v' '-O3' '-ffast-math' '-fopenmp' '-march=native' '-Q' '--help=target' '-pthread'
/opt/compiler-explorer/gcc-8.3.0/bin/../libexec/gcc/x86_64-linux-gnu/8.3.0/cc1 -v -imultiarch x86_64-linux-gnu -iprefix /opt/compiler-explorer/gcc-8.3.0/bin/../lib/gcc/x86_64-linux-gnu/8.3.0/ -D_REENTRANT help-dummy -march=skylake-avx512 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mavx512f -mno-avx512er -mavx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mavx512dq -mavx512bw -mavx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mclwb -mno-mwaitx -mno-clzero -mpku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=33792 -mtune=skylake-avx512 -dumpbase help-dummy -masm=intel -auxbase-strip ./output.s -g -O3 -version -fdiagnostics-color=always -ffast-math -fopenmp --help=target -o ./output.s
GNU C17 (Compiler-Explorer-Build) version 8.3.0 (x86_64-linux-gnu)
compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.18-GMP
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler returned: 0
2. This is from my vps
```shell
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
COLLECT_GCC_OPTIONS='-march=native' '-Q' '--help=target' '-v'
/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/cc1 -v help-dummy -march=skylake-avx512 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mavx512f -mno-avx512er -mavx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mno-xsaves -mavx512dq -mavx512bw -mavx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mclwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake-avx512 -dumpbase help-dummy -auxbase help-dummy -version --help=target -o /tmp/cc3URSRH.s
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mandroid [disabled]
-march= skylake-avx512
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [disabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bitalg [disabled]
-mavx512bw [enabled]
-mavx512cd [enabled]
-mavx512dq [enabled]
-mavx512er [disabled]
-mavx512f [enabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [enabled]
-mavx512vnni [disabled]
-mavx512vpopcntdq [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet
-mcet-switch [disabled]
-mcld [disabled]
-mclflushopt [enabled]
-mclwb [enabled]
-mclzero [disabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [enabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-mintel-syntax
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128
-mprefer-vector-width= 256
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [enabled]
-msahf [enabled]
-msgx [disabled]
-msha [disabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [disabled]
-msse5
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtune-ctrl=
-mtune= skylake-avx512
-muclibc [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwbnoinvd [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [disabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option)
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
GNU C17 (GCC) version 8.3.1 20190311 (Red Hat 8.3.1-3) (x86_64-redhat-linux)
compiled by GNU C version 8.3.1 20190311 (Red Hat 8.3.1-3), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version isl-0.16.1-GMP
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
COLLECT_GCC_OPTIONS='-march=native' '-Q' '--help=target' '-v'
/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/as -v --64 -o /tmp/ccBg1zy2.o /tmp/cc3URSRH.s
GNU assembler version 2.30 (x86_64-redhat-linux) using BFD version version 2.30-54.el7
Alright, I found the issue. This should work with the current master. I was simply using bench_fun_f
instead of bench_fun_d
. This means that the code was computing the logarithm in double
, but with float
operands/results. This resulted in float
<->double
conversions, which apparently are not supported if the vector size is 256.
Is is possible to circumvent this problem by some magic like this for a limited range and little endian cpu only ?
inline int double2int( double d )
{
union Cast
{
double d;
int i;
} c;
const double magic = 1.5*(1LL<<52);
c.d = d + magic;
return c.i;
}
What problem are you speaking about?
Sorry, maybe my last reply is irrelevant to this issue.
Because of the algorithmic right shift
operation >>52
is inevitable but it cannot be vectorized without avx512dq instructions, I can never expect to get this double
version code accelerated as much as float32
version in non-avx512dq
CPU, even the CPU have avx2
instruction set and 256bit registers available. Right?
Indeed, I don't see a way to do that. The right shift is not the only issue: conversions between 64bits integers and doubles are neither available.
Could you give some hints?
I installed sollya. The output is 156 bits so it should be accurate enough for double precision. But I am not very familiar with the other part of the functions in
simd_math_prims.h
.