AVX2 variation of c++ - Githubissues

joelandman commented 2 years ago

Use -march=native -O to enable.

joelandman commented 2 years ago

Added an unrolled julia with simplified loops/execution.

niklas-heer commented 2 years ago

First thank you @joelandman. The question for me is what do you intend with the program code? Is it to be added as separate runs alongside the rest or to be execute instead of the current implementations?

joelandman commented 2 years ago

I would suggest a separate run. I was curious as to whether I could vectorize this code using AVX2. This is a naive 1st pass.

joelandman commented 2 years ago

Note that for the julia code, I wanted to see if simplifying it and unrolling it would have an impact. It looks like it is noticeable, in part due to the impact of reducing the number of loop test conditionals.

This shouldn't replace the base code, rather be along side.

niklas-heer commented 2 years ago

That is fair enough. To run your code new entries have to be added to the Earthfile.

If you need help with implementing that let me know. Then I can do that for you. Otherwise if you want to give it a shot here are two relevant examples:

https://github.com/niklas-heer/speed-comparison/blob/769babc00014ac3456c877a196f7fb53b4140912/Earthfile#L109-L115

https://github.com/niklas-heer/speed-comparison/blob/769babc00014ac3456c877a196f7fb53b4140912/Earthfile#L201-L209

joelandman commented 2 years ago

Ok, let me play with this, and if I can get it working here, I'll add this to the PR.

joelandman commented 2 years ago

Updated Earthfile to include c++-avx2, and julia_ux4

niklas-heer commented 2 years ago

There doesn't seem that much difference. The C++ solution seems to be about the same. The Julia solutions seems to be a bit slower than the standard solution. combined_results

niklas-heer commented 2 years ago

@joelandman thank you for your contribution! :+1:

joelandman commented 2 years ago

Interesting. On my machine (Epyc 7551 Zen1 AMD), they are about 2x faster for the AVX, and 33% faster for Julia. I'll try to replicate this on the alpine distro (I use debian 11 on my machine).

Moelf commented 2 years ago

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

niklas-heer commented 2 years ago

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

That could be it. I don't know yet when I will get around to start implementing #59.

joelandman commented 2 years ago

I could do this easily for Julia. Might be harder for C++ (as startup is minimal). The reason I thought the julia unroll is faster, is this:

Original:

julia> using BenchmarkTools
julia> struct SignVector <: AbstractVector{Float64}
           len::Int
       end
julia> Base.size(s::SignVector) = (s.len,)
julia> Base.getindex(::SignVector, i::Int) = Float64((-1)^iseven(i))
julia> function f(rounds)
           xs = SignVector(rounds + 2)
           pi = 1.0
           @simd for i in 2:(rounds + 2)
               x = xs[i]
               pi += x / (2 * i - 1)
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> f(rounds)
3.1415926435880532
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  142.441 ms … 145.558 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.733 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   142.835 ms ± 493.724 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       █▃  ▁                                                     
  ▆▆▆▆▇██▆▁█▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  142 ms           Histogram: frequency by time          146 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unrolled:

julia> using BenchmarkTools
julia> function f(rounds)
           pi = 1.0
           x  = -1.0
           r2 = rounds + 2
           vend = r2 - r2 % 4
           @simd for i in 2:4:r2
               pi +=   x / (2.0 * i -  1.0) - 
                       x / (2.0 * i +  1.0) + 
                       x / (2.0 * i +  3.0) - 
                       x / (2.0 * i +  5.0) 
           end
           for i in vend+1:r2
                  pi += 1.0 / (2.0 * (i + 0.0) - 1.0)
               x = -x
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> print(f(rounds))
3.141592703611381
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  88.304 ms …  90.359 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.341 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.379 ms ± 267.732 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅     ▂   █ ▅                                         
  ▅▁▁▅▁▅▁▅███▅▅▅█▁▁▁████▅▁▁███▅█▅███▁▅▅▁█▅▁▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▅ ▁
  88.3 ms         Histogram: frequency by time         88.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

The unrolled is nearly 2x the speed of the plain. Similarly for the cpp versions

joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz.cpp -o l.x
joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz_avx2.cpp -o l_avx2.x
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l.x
3.1415926635893259
0.26user 0.00system **0:00.26elapsed** 100%CPU (0avgtext+0avgdata 1388maxresident)k
1inputs+0outputs (4major+66minor)pagefaults 0swaps
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x
3.1415926635945883
0.11user 0.00system **0:00.11elapsed** 98%CPU (0avgtext+0avgdata 1460maxresident)k
1inputs+0outputs (4major+70minor)pagefaults 0swaps

showing that the avx2 version was a little more than 2x the performance of the regular cpp version.

I'm not sure what the differences are, though my machine is running debian 11 (and therefore glibc), and the test cases for c++ are running on alpine and musl. I've lit up an alpine VM on that machine (using kvm) to see if I can compare, and maybe generate insight. I've heard anecdotally about performance differences, but haven't measured them.

joelandman commented 2 years ago

Just moved the code over to my zen2 laptop (Ryzen 7 4800H) and got this for the C++

3.1415926635893259
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 1540maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps
joe@zap:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x 
3.1415926635945883
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 1576maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps

So its roughly 4x faster with avx2 (which makes sense).

and for the julia

Original

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 108 samples with 1 evaluation.
 Range (min … max):  45.974 ms …  49.822 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.423 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.606 ms ± 542.685 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▂ ▅ █                                     ▃       
  ▃▁▃▃▃▃▃▅▃▆▅██▆█▇█▇█▄▄▃▃▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁██▁▁▁▃▃ ▃
  46 ms           Histogram: frequency by time         47.6 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unroll X4

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
 Range (min … max):  29.277 ms … 29.421 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.353 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.353 ms ± 24.620 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂           ▃ ▄█ ▅  ▂  ▂                   
  ▃▁▁▁▁▁▁▁▃▃▄▃▃▁▄▃▁▃█▄▄▃█▄▇▅▇█▇▆██████▆▅█▇▆█▅▇▄▅▄▇▁▁▁▃▁▄▁▃▄▁▄ ▃
  29.3 ms         Histogram: frequency by time        29.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Which shows the 30% or so better performance I was discussing. Also, the core c++ avx2 code is about the same performance as the julia unroll by 4 code. Which is inline with what many of us observe about julia.

Moelf commented 2 years ago

julia> @benchmark f(rounds)

yeah but this is not how this repo does timing, this repo calls

julia lebniz.jl

which includes 200ms or something start uptime and compile first run time. And the CI system is very unstable, all of the top languages should be exactly the same

niklas-heer / speed-comparison

AVX2 variation of c++ #74