mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
281 stars 15 forks source link

Exponential: Dual Xeon Gold 6154 result #11

Open mratsim opened 5 years ago

mratsim commented 5 years ago

cc @laurae2

Warmup: 0.9938 s, result 224 (displayed to avoid compiler optimizing warmup away)

A - tensor shape: [5000000]
Required number of operations:     5.000 millions
Required bytes:                   20.000 MB
Arithmetic intensity:              0.250 FLOP/byte
Theoretical peak single-core:     86.400 GFLOP/s
Theoretical peak multi:          172.800 GFLOP/s
a[0]: -9.999997138977051

Baseline <math.h>
Collected 300 samples in 5.625 seconds
Average time: 17.733 ms
Stddev  time: 0.054 ms
Min     time: 17.698 ms
Max     time: 18.148 ms
Perf:         0.282 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540005829767324e-05

SSE mathfun
Collected 300 samples in 2.094 seconds
Average time: 6.021 ms
Stddev  time: 0.043 ms
Min     time: 5.976 ms
Max     time: 6.544 ms
Perf:         0.830 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

SSE fast_exp_sse (low order polynomial)
Collected 300 samples in 1.211 seconds
Average time: 3.073 ms
Stddev  time: 0.062 ms
Min     time: 3.019 ms
Max     time: 3.734 ms
Perf:         1.627 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.545032061287202e-05

AVX2 fmath
Collected 300 samples in 1.060 seconds
Average time: 2.558 ms
Stddev  time: 0.067 ms
Min     time: 2.473 ms
Max     time: 3.056 ms
Perf:         1.955 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX2 FMA Minimax
Collected 300 samples in 1.042 seconds
Average time: 2.492 ms
Stddev  time: 0.076 ms
Min     time: 2.383 ms
Max     time: 3.050 ms
Perf:         2.006 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539992369245738e-05

AVX2 mathfun
Collected 300 samples in 1.307 seconds
Average time: 3.382 ms
Stddev  time: 0.067 ms
Min     time: 3.275 ms
Max     time: 3.906 ms
Perf:         1.478 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX+FMA Schraudolph-approx
Collected 300 samples in 0.933 seconds
Average time: 2.128 ms
Stddev  time: 0.066 ms
Min     time: 2.062 ms
Max     time: 2.607 ms
Perf:         2.350 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.625692963600159e-05

Bench SIMD Math Prims
Collected 300 samples in 4.826 seconds
Average time: 15.079 ms
Stddev  time: 0.189 ms
Min     time: 14.945 ms
Max     time: 15.937 ms
Perf:         0.332 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539986548479646e-05
c-blake commented 5 years ago

That's pretty good for exp. For log it is usually helpful to take advantage of the fact that most CPUs can usually domain/range reduce for you to just one multiple of two. Specifically, any IEEE float (single, double, probably even half-precision) has an exponent field which tells you the logarithm base 2 up to the mantissa field. So, if yo are willing to assume IEEE float representation then you only actually need a good fast approximation of log_2 on [1/2,1) or [1,2). (log base e of course is just re-scaled log_2).

In the past I've done the mantissa part with a mixture of Taylor series for near 1 and Padé approximants for further away. That got down to about 60 to 17 cycles on Intel core2 series from Nehalem to Skylake, respectively, for a 1e-7 relative error single precision log, anyway.

There may be a faster approach than Taylor/Padé, but it seemed very likely the "IEEE exponent field" trick would be applicable to whatever you do if you're willing to assume IEEE. I also perused your references and didn't see it mentioned. They seem more exp-focused, but log is generally more poorly optimized by various stdlibs/vendors in my experience.

(edit: clarified performance mention in time and accuracy)

mratsim commented 5 years ago

Log focus is coming, I didn't get to that yet.

c-blake commented 5 years ago

Well, when you do get to log, I suggest as a starting point these blog posts from some Ebay guy. {That material didn't exist back when I was figuring this out for myself :-) } :

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-i-the-basics/

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-ii-rounding-error/

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-iii-the-formulas/

He never really mentions Padé approximants, though his rational approximations are similar. He argues for a [3/4, 6/4) interval. (As mentioned, any [t,2t) works - whatever gets you a fast, well approximated "shifted mantissa".) Seems like if you have all this Intel-specific SIMD code, assuming IEEE (even little-endian IEEE) is not much of a stretch.

Particularly with log, there can be unavoidable speed/accuracy trade-offs. So, you may want to have a Nim call that accepts a target precision (and conceivably even returns an error estimate), and then just default it to 23 bits for float32 and 51 bits for float64 and let the call-site use a larger error when faster performance is desired and in-context lower accuracy is acceptable.