Closed joelandman closed 2 years ago
Added an unrolled julia with simplified loops/execution.
First thank you @joelandman. The question for me is what do you intend with the program code? Is it to be added as separate runs alongside the rest or to be execute instead of the current implementations?
I would suggest a separate run. I was curious as to whether I could vectorize this code using AVX2. This is a naive 1st pass.
Note that for the julia code, I wanted to see if simplifying it and unrolling it would have an impact. It looks like it is noticeable, in part due to the impact of reducing the number of loop test conditionals.
This shouldn't replace the base code, rather be along side.
That is fair enough. To run your code new entries have to be added to the Earthfile
.
If you need help with implementing that let me know. Then I can do that for you. Otherwise if you want to give it a shot here are two relevant examples:
Ok, let me play with this, and if I can get it working here, I'll add this to the PR.
Updated Earthfile to include c++-avx2, and julia_ux4
There doesn't seem that much difference. The C++ solution seems to be about the same. The Julia solutions seems to be a bit slower than the standard solution.
@joelandman thank you for your contribution! :+1:
Interesting. On my machine (Epyc 7551 Zen1 AMD), they are about 2x faster for the AVX, and 33% faster for Julia. I'll try to replicate this on the alpine distro (I use debian 11 on my machine).
The Julia solutions seems to be a bit slower than the standard solution.
iteration too little, start up time dominate
The Julia solutions seems to be a bit slower than the standard solution.
iteration too little, start up time dominate
That could be it. I don't know yet when I will get around to start implementing #59.
I could do this easily for Julia. Might be harder for C++ (as startup is minimal). The reason I thought the julia unroll is faster, is this:
Original:
julia> using BenchmarkTools
julia> struct SignVector <: AbstractVector{Float64}
len::Int
end
julia> Base.size(s::SignVector) = (s.len,)
julia> Base.getindex(::SignVector, i::Int) = Float64((-1)^iseven(i))
julia> function f(rounds)
xs = SignVector(rounds + 2)
pi = 1.0
@simd for i in 2:(rounds + 2)
x = xs[i]
pi += x / (2 * i - 1)
end
return pi*4
end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> f(rounds)
3.1415926435880532
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
Range (min … max): 142.441 ms … 145.558 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 142.733 ms ┊ GC (median): 0.00%
Time (mean ± σ): 142.835 ms ± 493.724 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃ ▁
▆▆▆▆▇██▆▁█▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
142 ms Histogram: frequency by time 146 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
Unrolled:
julia> using BenchmarkTools
julia> function f(rounds)
pi = 1.0
x = -1.0
r2 = rounds + 2
vend = r2 - r2 % 4
@simd for i in 2:4:r2
pi += x / (2.0 * i - 1.0) -
x / (2.0 * i + 1.0) +
x / (2.0 * i + 3.0) -
x / (2.0 * i + 5.0)
end
for i in vend+1:r2
pi += 1.0 / (2.0 * (i + 0.0) - 1.0)
x = -x
end
return pi*4
end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> print(f(rounds))
3.141592703611381
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
Range (min … max): 88.304 ms … 90.359 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 88.341 ms ┊ GC (median): 0.00%
Time (mean ± σ): 88.379 ms ± 267.732 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ ▂ █ ▅
▅▁▁▅▁▅▁▅███▅▅▅█▁▁▁████▅▁▁███▅█▅███▁▅▅▁█▅▁▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▅ ▁
88.3 ms Histogram: frequency by time 88.4 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
The unrolled is nearly 2x the speed of the plain. Similarly for the cpp versions
joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz.cpp -o l.x
joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz_avx2.cpp -o l_avx2.x
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l.x
3.1415926635893259
0.26user 0.00system **0:00.26elapsed** 100%CPU (0avgtext+0avgdata 1388maxresident)k
1inputs+0outputs (4major+66minor)pagefaults 0swaps
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x
3.1415926635945883
0.11user 0.00system **0:00.11elapsed** 98%CPU (0avgtext+0avgdata 1460maxresident)k
1inputs+0outputs (4major+70minor)pagefaults 0swaps
showing that the avx2 version was a little more than 2x the performance of the regular cpp version.
I'm not sure what the differences are, though my machine is running debian 11 (and therefore glibc), and the test cases for c++ are running on alpine and musl. I've lit up an alpine VM on that machine (using kvm) to see if I can compare, and maybe generate insight. I've heard anecdotally about performance differences, but haven't measured them.
Just moved the code over to my zen2 laptop (Ryzen 7 4800H) and got this for the C++
3.1415926635893259
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 1540maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps
joe@zap:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x
3.1415926635945883
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 1576maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps
So its roughly 4x faster with avx2 (which makes sense).
and for the julia
Original
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 108 samples with 1 evaluation.
Range (min … max): 45.974 ms … 49.822 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 46.423 ms ┊ GC (median): 0.00%
Time (mean ± σ): 46.606 ms ± 542.685 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂ ▅ █ ▃
▃▁▃▃▃▃▃▅▃▆▅██▆█▇█▇█▄▄▃▃▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁██▁▁▁▃▃ ▃
46 ms Histogram: frequency by time 47.6 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
Unroll X4
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
Range (min … max): 29.277 ms … 29.421 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.353 ms ┊ GC (median): 0.00%
Time (mean ± σ): 29.353 ms ± 24.620 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▃ ▄█ ▅ ▂ ▂
▃▁▁▁▁▁▁▁▃▃▄▃▃▁▄▃▁▃█▄▄▃█▄▇▅▇█▇▆██████▆▅█▇▆█▅▇▄▅▄▇▁▁▁▃▁▄▁▃▄▁▄ ▃
29.3 ms Histogram: frequency by time 29.4 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
Which shows the 30% or so better performance I was discussing. Also, the core c++ avx2 code is about the same performance as the julia unroll by 4 code. Which is inline with what many of us observe about julia.
julia> @benchmark f(rounds)
yeah but this is not how this repo does timing, this repo calls
julia lebniz.jl
which includes 200ms or something start uptime and compile first run time. And the CI system is very unstable, all of the top languages should be exactly the same
Use
-march=native -O
to enable.