Closed svretina closed 5 months ago
Thanks for reaching out. Will look into it as I do not have much insights about Polyester.jl
I don't know if it helps but for anything that I have used until now, replacing @threads
with @batch
was working and usually you can get something non allocating
I forked the repo and I added
using Polyester
before the function add_loop
in the file /src/ParallelKernel/parallel.jl
. I replaced the @threads
with @batch
and it compiled just fine.
I used the acoustic2D.jl
miniapp to test the code and it executed just fine.
I also got a slight performance upgrade from T_eff = 0.39 GB/s
to T_eff = 0.41 GB/s
.
The @threads
version of the miniapp is allocating 6 times with a total of 688 bytes.
The @batch
version is not allocating anything.
I pushed my changes in the forked repo. I know the implementation I did is not elegant, I just wanted to see if it is a simple as replacing the macro call. I hope this is easy to implement and of interest to your project.
edit: forgot to mention that I was using 40 threads
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 48 × Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 40 default, 0 interactive, 20 GC (on 48 virtual cores)
Out of curiosity, how much does @time using ParallelStencil
and the compile time of the @parallel
kernels increase with Polyester.jl?
Out of curiosity, how much does
@time using ParallelStencil
and the compile time of the@parallel
kernels increase with Polyester.jl?
I will post some timings later about the @time using ParallelStencil
, but I don't know how to measure the 2nd one.
bump?
Hi @svretina : thanks for the proposition! It definitely makes sense to add support for Polyester either replacing threads or as an extension.
Out of curiosity, how much does
@time using ParallelStencil
and the compile time of the@parallel
kernels increase with Polyester.jl?I will post some timings later about the
@time using ParallelStencil
, but I don't know how to measure the 2nd one.
What were your results?
If it increases the time non-negligibly, then we can do an extension for Polyester...
So for my naive implementation and the above mentioned machine I get:
(@v1.10) pkg> activate ../..
Activating project at `~/ParallelStencil.jl`
In [2]: @time using ParallelStencil
0.193122 seconds (110.22 k allocations: 8.498 MiB, 13.41% compilation time)
and without the addition of Polyester.jl
I get:
(@v1.10) pkg> activate ../..
Activating project at `~/ParallelStencil.jl`
In [2]: @time using ParallelStencil
0.044522 seconds (18.59 k allocations: 1.644 MiB, 27.85% compilation time)
Did you wipe out the precompiled files and related folders in between? It looks kind of really different. Maybe one would need to create a tmp project and time it in there from scratch to get a reliable timing? (unless that's already what you've done and in this case Polyester significantly increases compile time).
So once I change the files should I report the initial timing or after it compiled one time, I exit the julia repl, open again and then time it? Excume my ignorance around these things.
The timings I reported where done as I described above.
Thanks for reporting. I guess there is no well-defined way to really measure the precompilation time unless one starts from a blank project and make sure the Julia precompiled folder is empty.
I guess one way to enforce things would be to
JULIA_LOAD_PATH
let's say different from the default .Julia
folder.using Pkg: @time Pkg.precompile
or similar).This could be repeated for the two versions of ParallelStencil, making sure to fully wipe out the folder holding the load path.
So for my naive implementation and the above mentioned machine I get:
(@v1.10) pkg> activate ../.. Activating project at `~/ParallelStencil.jl` In [2]: @time using ParallelStencil 0.193122 seconds (110.22 k allocations: 8.498 MiB, 13.41% compilation time)
and without the addition of
Polyester.jl
I get:(@v1.10) pkg> activate ../.. Activating project at `~/ParallelStencil.jl` In [2]: @time using ParallelStencil 0.044522 seconds (18.59 k allocations: 1.644 MiB, 27.85% compilation time)
Thanks! I guess the overhead of Polyester is acceptable, especially as an extension. Could you measure also the time to run a @parallel
kernel for the first time? For example, in a fresh REPL session (one for standard ParallelStencil, and one for the one with Polyester):
@parallel function axpy!(z, x, y, a)
@all(z) = @all(x) * a + @all(y)
return
end
n = 64
z, x, y = rand(n, n), rand(n, n), rand(n, n)
a = rand()
@time @parallel axpy!(z, x, y, a)
@luraess @albert-de-montserrat thanks for the input. I will try to do those things
So I cloned the forked repo into a fresh folder and did the following: ( note that I am running the tests on a head node, which means that timings may vary due to use from other users.)
(@v1.10) pkg> st
Status `~/.julia/environments/v1.10/Project.toml`
[c7e460c6] ArgParse v1.1.5
[6e4b80f9] BenchmarkTools v1.5.0
⌃ [f67ccb44] HDF5 v0.17.1
[4d7a3746] ImplicitGlobalGrid v0.15.0
[da04e1cc] MPI v0.20.19
[3da0fdf6] MPIPreferences v0.1.10
[5fb14364] OhMyREPL v0.5.24
[e4faabce] PProf v3.1.0
⌃ [91a5bcdd] Plots v1.40.2
⌃ [f517fe37] Polyester v0.7.9
[efd6af41] ProfileCanvas v0.1.6
[132c30aa] ProfileSVG v0.2.2
[c46f51b8] ProfileView v1.7.2
[295af30f] Revise v3.5.14
[a8a75453] StatProfilerHTML v1.6.0
[90137ffa] StaticArrays v1.9.3
[ddb6d928] YAML v0.4.9
Info Packages marked with ⌃ have new versions available and may be upgradable.
(@v1.10) pkg> up
Updating registry at `~/.julia/registries/General.toml`
Installed Pango_jll ───────── v1.52.2+0
Installed OpenSSL ─────────── v1.4.3
Installed HDF5_jll ────────── v1.14.3+3
Installed DataStructures ──── v0.18.20
Installed MPICH_jll ───────── v4.2.1+0
Installed HTTP ────────────── v1.10.6
Installed StableRNGs ──────── v1.0.2
Installed Plots ───────────── v1.40.4
Installed Latexify ────────── v0.16.3
Installed MPItrampoline_jll ─ v5.3.3+0
Installed HDF5 ────────────── v0.17.2
Downloaded artifact: Pango
Downloaded artifact: HDF5
Updating `~/.julia/environments/v1.10/Project.toml`
[f67ccb44] ↑ HDF5 v0.17.1 ⇒ v0.17.2
[91a5bcdd] ↑ Plots v1.40.2 ⇒ v1.40.4
[f517fe37] ↑ Polyester v0.7.9 ⇒ v0.7.13
Updating `~/.julia/environments/v1.10/Manifest.toml`
[4fba245c] ↑ ArrayInterface v7.9.0 ⇒ v7.10.0
[3da002f7] ↑ ColorTypes v0.11.4 ⇒ v0.11.5
[864edb3b] ↑ DataStructures v0.18.18 ⇒ v0.18.20
[f67ccb44] ↑ HDF5 v0.17.1 ⇒ v0.17.2
[cd3eb016] ↑ HTTP v1.10.5 ⇒ v1.10.6
[23fbe1c1] ↑ Latexify v0.16.2 ⇒ v0.16.3
[e1d29d7a] ↑ Missings v1.1.0 ⇒ v1.2.0
[4d8831e6] ↑ OpenSSL v1.4.2 ⇒ v1.4.3
[91a5bcdd] ↑ Plots v1.40.2 ⇒ v1.40.4
[f517fe37] ↑ Polyester v0.7.9 ⇒ v0.7.13
[860ef19b] ↑ StableRNGs v1.0.1 ⇒ v1.0.2
[7792a7ef] ↑ StrideArraysCore v0.5.2 ⇒ v0.5.6
[0234f1f7] ↑ HDF5_jll v1.14.3+2 ⇒ v1.14.3+3
[7cb0a576] ↑ MPICH_jll v4.2.0+0 ⇒ v4.2.1+0
[f1f71cc9] ↑ MPItrampoline_jll v5.3.2+0 ⇒ v5.3.3+0
[36c8627f] ↑ Pango_jll v1.52.1+0 ⇒ v1.52.2+0
Precompiling project...
✗ GtkObservables
✗ ProfileView
34 dependencies successfully precompiled in 96 seconds. 208 already precompiled.
2 dependencies errored.
For a report of the errors see `julia> err`. To retry use `pkg> precompile`
(@v1.10) pkg> activate .
Activating project at `~/MyParallelStencil.jl`
In [8]: using Pkg
In [9]: @time Pkg.precompile("ParallelStencil")
Updating registry at `~/.julia/registries/General.toml`
Updating `~/MyParallelStencil.jl/Project.toml`
[d35fcfd7] + CellArrays v0.2.1
[1914dd2f] + MacroTools v0.5.13
[f517fe37] + Polyester v0.7.13
[90137ffa] + StaticArrays v1.9.3
Updating `~/MyParallelStencil.jl/Manifest.toml`
[79e6a3ab] + Adapt v4.0.4
[4fba245c] + ArrayInterface v7.10.0
[62783981] + BitTwiddlingConvenienceFunctions v0.1.5
[2a0fbf3d] + CPUSummary v0.2.4
[d35fcfd7] + CellArrays v0.2.1
[fb6a15b2] + CloseOpenIntervals v0.1.12
[34da2185] + Compat v4.14.0
[adafc99b] + CpuId v0.3.1
[615f187c] + IfElse v0.1.1
[10f19ff3] + LayoutPointers v0.1.15
[1914dd2f] + MacroTools v0.5.13
[d125e4d3] + ManualMemory v0.1.8
[f517fe37] + Polyester v0.7.13
[1d0040c9] + PolyesterWeave v0.2.1
[aea7be01] + PrecompileTools v1.2.1
[21216c6a] + Preferences v1.4.3
[ae029012] + Requires v1.3.0
[94e857df] + SIMDTypes v0.1.0
[aedffcd0] + Static v0.8.10
[0d7ed370] + StaticArrayInterface v1.5.0
[90137ffa] + StaticArrays v1.9.3
[1e83bf80] + StaticArraysCore v1.4.2
[7792a7ef] + StrideArraysCore v0.5.6
[8290d209] + ThreadingUtilities v0.5.2
[56f22d72] + Artifacts
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[d6f4376e] + Markdown
[de0858da] + Printf
[9a3f8284] + Random
[ea8e919c] + SHA v0.7.0
[9e88b42a] + Serialization
[2f01184e] + SparseArrays v1.10.0
[4607b0f0] + SuiteSparse
[fa267f1f] + TOML v1.0.3
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
[e66e0078] + CompilerSupportLibraries_jll v1.1.0+0
[4536629a] + OpenBLAS_jll v0.3.23+4
[bea87d4a] + SuiteSparse_jll v7.2.1+1
[8e850b90] + libblastrampoline_jll v5.8.0+1
Precompiling ParallelStencil
1 dependency successfully precompiled in 3 seconds. 31 already precompiled.
3.604517 seconds (557.00 k allocations: 60.079 MiB, 0.75% gc time, 1.75% compilation time)
@time Pkg.precompile("ParallelStencil")
0.142622 seconds (42.95 k allocations: 6.031 MiB)
then I cloned ParallelStencil.jl
in a fresh folder,
(@v1.10) pkg> activate .
Activating project at `~/ParallelStencil.jl`
In [2]: using Pkg
In [3]: @time Pkg.precompile("ParallelStencil")
Updating registry at `~/.julia/registries/General.toml`
Updating `~/ParallelStencil.jl/Project.toml`
[d35fcfd7] + CellArrays v0.2.1
[1914dd2f] + MacroTools v0.5.13
[90137ffa] + StaticArrays v1.9.3
Updating `~/ParallelStencil.jl/Manifest.toml`
[79e6a3ab] + Adapt v4.0.4
[d35fcfd7] + CellArrays v0.2.1
[1914dd2f] + MacroTools v0.5.13
[aea7be01] + PrecompileTools v1.2.1
[21216c6a] + Preferences v1.4.3
[ae029012] + Requires v1.3.0
[90137ffa] + StaticArrays v1.9.3
[1e83bf80] + StaticArraysCore v1.4.2
[56f22d72] + Artifacts
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[d6f4376e] + Markdown
[de0858da] + Printf
[9a3f8284] + Random
[ea8e919c] + SHA v0.7.0
[fa267f1f] + TOML v1.0.3
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
[e66e0078] + CompilerSupportLibraries_jll v1.1.0+0
[4536629a] + OpenBLAS_jll v0.3.23+4
[8e850b90] + libblastrampoline_jll v5.8.0+1
Precompiling ParallelStencil
1 dependency successfully precompiled in 3 seconds. 9 already precompiled.
7.199189 seconds (4.37 M allocations: 359.373 MiB, 5.64% gc time, 52.31% compilation time: 45% of which was recompilation)
@time Pkg.precompile("ParallelStencil")
0.158071 seconds (12.55 k allocations: 1.745 MiB)
the precompile says 3 seconds and the @time
7seconds, which does not make sense. But both of them where 3 seconds I guess.
But I think Polyester was precompiled already. Should I remove Polyester from the julia environment and rerun those ?
for the precompilation time of @parallel
:
@init_parallel_stencil(Threads, Float64, 3)
@parallel function axpy!(z, x, y, a)
@all(z) = @all(x) * a + @all(y)
return
end
ERROR: LoadError: UndefVarError: `Polyester` not defined
in expression starting at /home/svretinaris/MyParallelStencil.jl/src/ParallelKernel/parallel.jl:476
(ParallelStencil) pkg> st
Project ParallelStencil v0.12.0
Status `~/MyParallelStencil.jl/Project.toml`
[d35fcfd7] CellArrays v0.2.1
[1914dd2f] MacroTools v0.5.13
[f517fe37] Polyester v0.7.13
[90137ffa] StaticArrays v1.9.3
[9a3f8284] Random
using Polyester
@parallel function axpy!(z, x, y, a)
@all(z) = @all(x) * a + @all(y)
return
end
axpy! (generic function with 1 method)
n = 64;
z, x, y = rand(n, n), rand(n, n), rand(n, n);
a = rand();
@time @parallel axpy!(z, x, y, a)
0.536370 seconds (684.17 k allocations: 45.835 MiB, 2.87% gc time, 99.97% compilation time)
and for ParallelStencil.jl
:
(@v1.10) pkg> activate .
Activating project at `~/ParallelStencil.jl`
using ParallelStencil
using ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(Threads, Float64, 3)
@parallel function axpy!(z, x, y, a)
@all(z) = @all(x) * a + @all(y)
return
end
axpy! (generic function with 1 method)
n = 64;
z, x, y = rand(n, n), rand(n, n), rand(n, n);
a = rand();
@time @parallel axpy!(z, x, y, a)
0.135649 seconds (125.64 k allocations: 8.546 MiB, 7.90% gc time, 99.83% compilation time)
I hope the timings I did, are done in a correct way
@svretina I'm working on adding an extension for polyester
So the initial timings I reported at the start of the thread ( not the compile timings ) were wrong.
I rerun the acoustic3D.jl
miniapp, with a slight modification,
the function compute_V!
is rewritten with @parallel_indices
instead of @parallel
.
For the Base.Threads
version I get
include("miniapps/acoustic3D.jl")
Total steps=1000, time=3.185e+01 sec (@ T_eff = 33.00 GB/s)
and for the Polyester.@batch
I get
include("/home/svretinaris/acoustic3D.jl")
Total steps=1000, time=1.961e+01 sec (@ T_eff = 54.00 GB/s)
it is a 61% performance increase!
this is by using 40 out of 48 cores on the machine.
For 48 cores it goes up to 59 GB/s
, the theoretical memory bandwidth for the machine I run it is around 70 GB/s
It would be interesting to report how close one would get to the T_peak
limit in the first figure of the README.md.
I could squeeze a bit more performance and get to 60 GB/s
on the same machine by adding
stride=true
to the @batch
. More info about this option can be found here
I finally had time to play around interactively at a node (no interference from other users).
the node info:
versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 80 × Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 40 default, 0 interactive, 20 GC (on 80 virtual cores)
Environment:
LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64
according to WikiChip this processor has a maximum memory bandwidth of 119.21 GB/s
In [3]: include("/home/svretinaris/acoustic3D.jl")
Total steps=1000, time=9.280e+00 sec (@ T_eff = 113.000 GB/s)
In [5]: 113/119.21
0.9479070547772839
In [6]: acoustic3D()
Total steps=1000, time=1.004e+01 sec (@ T_eff = 105.000 GB/s)
I don't know how to calculate the T_peak
but it looks very close to optimal.
Indeed - fairly close 🚀
T_peak
you could get doing only a memcpy like saxpy or similar, reading from one array into another making sure the memory operation actually happens (like by adding a scalar or so).
amazing! looking forward using this!
How difficult would it be to add support for
Polyester
's@batch
? It is faster thanBase.Threads.@threads
and most of the time it's non allocating as well, in contrast to some allocations of@threads
.