omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
323 stars 38 forks source link

Add support for Polyester's `@batch` #148

Closed svretina closed 5 months ago

svretina commented 7 months ago

How difficult would it be to add support for Polyester's @batch? It is faster than Base.Threads.@threads and most of the time it's non allocating as well, in contrast to some allocations of @threads.

luraess commented 7 months ago

Thanks for reaching out. Will look into it as I do not have much insights about Polyester.jl

svretina commented 7 months ago

I don't know if it helps but for anything that I have used until now, replacing @threads with @batch was working and usually you can get something non allocating

svretina commented 7 months ago

I forked the repo and I added

using Polyester

before the function add_loop in the file /src/ParallelKernel/parallel.jl . I replaced the @threads with @batch and it compiled just fine. I used the acoustic2D.jl miniapp to test the code and it executed just fine. I also got a slight performance upgrade from T_eff = 0.39 GB/s to T_eff = 0.41 GB/s.

The @threads version of the miniapp is allocating 6 times with a total of 688 bytes. The @batch version is not allocating anything.

I pushed my changes in the forked repo. I know the implementation I did is not elegant, I just wanted to see if it is a simple as replacing the macro call. I hope this is easy to implement and of interest to your project.

edit: forgot to mention that I was using 40 threads

Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 40 default, 0 interactive, 20 GC (on 48 virtual cores)
albert-de-montserrat commented 7 months ago

Out of curiosity, how much does @time using ParallelStencil and the compile time of the @parallel kernels increase with Polyester.jl?

svretina commented 7 months ago

Out of curiosity, how much does @time using ParallelStencil and the compile time of the @parallel kernels increase with Polyester.jl?

I will post some timings later about the @time using ParallelStencil, but I don't know how to measure the 2nd one.

svretina commented 7 months ago

bump?

omlins commented 7 months ago

Hi @svretina : thanks for the proposition! It definitely makes sense to add support for Polyester either replacing threads or as an extension.

Out of curiosity, how much does @time using ParallelStencil and the compile time of the @parallel kernels increase with Polyester.jl?

I will post some timings later about the @time using ParallelStencil, but I don't know how to measure the 2nd one.

What were your results?

If it increases the time non-negligibly, then we can do an extension for Polyester...

svretina commented 7 months ago

So for my naive implementation and the above mentioned machine I get:

(@v1.10) pkg> activate ../..
  Activating project at `~/ParallelStencil.jl`

In [2]: @time using ParallelStencil
  0.193122 seconds (110.22 k allocations: 8.498 MiB, 13.41% compilation time)

and without the addition of Polyester.jl I get:

(@v1.10) pkg> activate ../..
  Activating project at `~/ParallelStencil.jl`

In [2]: @time using ParallelStencil
  0.044522 seconds (18.59 k allocations: 1.644 MiB, 27.85% compilation time)
luraess commented 6 months ago

Did you wipe out the precompiled files and related folders in between? It looks kind of really different. Maybe one would need to create a tmp project and time it in there from scratch to get a reliable timing? (unless that's already what you've done and in this case Polyester significantly increases compile time).

svretina commented 6 months ago

So once I change the files should I report the initial timing or after it compiled one time, I exit the julia repl, open again and then time it? Excume my ignorance around these things.

The timings I reported where done as I described above.

luraess commented 6 months ago

Thanks for reporting. I guess there is no well-defined way to really measure the precompilation time unless one starts from a blank project and make sure the Julia precompiled folder is empty.

I guess one way to enforce things would be to

  1. temporarily define a load path JULIA_LOAD_PATH let's say different from the default .Julia folder.
  2. Then, one could also unset the auto precompile (see the doc) to avoid triggering precompilation when adding the package.
  3. Then one can manually trigger precompile and time it (using Pkg: @time Pkg.precompile or similar).

This could be repeated for the two versions of ParallelStencil, making sure to fully wipe out the folder holding the load path.

albert-de-montserrat commented 6 months ago

So for my naive implementation and the above mentioned machine I get:

(@v1.10) pkg> activate ../..
  Activating project at `~/ParallelStencil.jl`

In [2]: @time using ParallelStencil
  0.193122 seconds (110.22 k allocations: 8.498 MiB, 13.41% compilation time)

and without the addition of Polyester.jl I get:

(@v1.10) pkg> activate ../..
  Activating project at `~/ParallelStencil.jl`

In [2]: @time using ParallelStencil
  0.044522 seconds (18.59 k allocations: 1.644 MiB, 27.85% compilation time)

Thanks! I guess the overhead of Polyester is acceptable, especially as an extension. Could you measure also the time to run a @parallel kernel for the first time? For example, in a fresh REPL session (one for standard ParallelStencil, and one for the one with Polyester):

@parallel function axpy!(z, x, y, a)
   @all(z) = @all(x) * a + @all(y)
   return
end

n = 64
z, x, y = rand(n, n), rand(n, n), rand(n, n)
a = rand()

@time @parallel axpy!(z, x, y, a)
svretina commented 6 months ago

@luraess @albert-de-montserrat thanks for the input. I will try to do those things

svretina commented 6 months ago

So I cloned the forked repo into a fresh folder and did the following: ( note that I am running the tests on a head node, which means that timings may vary due to use from other users.)

(@v1.10) pkg> st
Status `~/.julia/environments/v1.10/Project.toml`
  [c7e460c6] ArgParse v1.1.5
  [6e4b80f9] BenchmarkTools v1.5.0
⌃ [f67ccb44] HDF5 v0.17.1
  [4d7a3746] ImplicitGlobalGrid v0.15.0
  [da04e1cc] MPI v0.20.19
  [3da0fdf6] MPIPreferences v0.1.10
  [5fb14364] OhMyREPL v0.5.24
  [e4faabce] PProf v3.1.0
⌃ [91a5bcdd] Plots v1.40.2
⌃ [f517fe37] Polyester v0.7.9
  [efd6af41] ProfileCanvas v0.1.6
  [132c30aa] ProfileSVG v0.2.2
  [c46f51b8] ProfileView v1.7.2
  [295af30f] Revise v3.5.14
  [a8a75453] StatProfilerHTML v1.6.0
  [90137ffa] StaticArrays v1.9.3
  [ddb6d928] YAML v0.4.9
Info Packages marked with ⌃ have new versions available and may be upgradable.

(@v1.10) pkg> up
    Updating registry at `~/.julia/registries/General.toml`
   Installed Pango_jll ───────── v1.52.2+0
   Installed OpenSSL ─────────── v1.4.3
   Installed HDF5_jll ────────── v1.14.3+3
   Installed DataStructures ──── v0.18.20
   Installed MPICH_jll ───────── v4.2.1+0
   Installed HTTP ────────────── v1.10.6
   Installed StableRNGs ──────── v1.0.2
   Installed Plots ───────────── v1.40.4
   Installed Latexify ────────── v0.16.3
   Installed MPItrampoline_jll ─ v5.3.3+0
   Installed HDF5 ────────────── v0.17.2
  Downloaded artifact: Pango
  Downloaded artifact: HDF5
    Updating `~/.julia/environments/v1.10/Project.toml`
  [f67ccb44] ↑ HDF5 v0.17.1 ⇒ v0.17.2
  [91a5bcdd] ↑ Plots v1.40.2 ⇒ v1.40.4
  [f517fe37] ↑ Polyester v0.7.9 ⇒ v0.7.13
    Updating `~/.julia/environments/v1.10/Manifest.toml`
  [4fba245c] ↑ ArrayInterface v7.9.0 ⇒ v7.10.0
  [3da002f7] ↑ ColorTypes v0.11.4 ⇒ v0.11.5
  [864edb3b] ↑ DataStructures v0.18.18 ⇒ v0.18.20
  [f67ccb44] ↑ HDF5 v0.17.1 ⇒ v0.17.2
  [cd3eb016] ↑ HTTP v1.10.5 ⇒ v1.10.6
  [23fbe1c1] ↑ Latexify v0.16.2 ⇒ v0.16.3
  [e1d29d7a] ↑ Missings v1.1.0 ⇒ v1.2.0
  [4d8831e6] ↑ OpenSSL v1.4.2 ⇒ v1.4.3
  [91a5bcdd] ↑ Plots v1.40.2 ⇒ v1.40.4
  [f517fe37] ↑ Polyester v0.7.9 ⇒ v0.7.13
  [860ef19b] ↑ StableRNGs v1.0.1 ⇒ v1.0.2
  [7792a7ef] ↑ StrideArraysCore v0.5.2 ⇒ v0.5.6
  [0234f1f7] ↑ HDF5_jll v1.14.3+2 ⇒ v1.14.3+3
  [7cb0a576] ↑ MPICH_jll v4.2.0+0 ⇒ v4.2.1+0
  [f1f71cc9] ↑ MPItrampoline_jll v5.3.2+0 ⇒ v5.3.3+0
  [36c8627f] ↑ Pango_jll v1.52.1+0 ⇒ v1.52.2+0
Precompiling project...
  ✗ GtkObservables
  ✗ ProfileView
  34 dependencies successfully precompiled in 96 seconds. 208 already precompiled.
  2 dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`

(@v1.10) pkg> activate .
  Activating project at `~/MyParallelStencil.jl`

In [8]: using Pkg

In [9]: @time Pkg.precompile("ParallelStencil")
    Updating registry at `~/.julia/registries/General.toml`
    Updating `~/MyParallelStencil.jl/Project.toml`
  [d35fcfd7] + CellArrays v0.2.1
  [1914dd2f] + MacroTools v0.5.13
  [f517fe37] + Polyester v0.7.13
  [90137ffa] + StaticArrays v1.9.3
    Updating `~/MyParallelStencil.jl/Manifest.toml`
  [79e6a3ab] + Adapt v4.0.4
  [4fba245c] + ArrayInterface v7.10.0
  [62783981] + BitTwiddlingConvenienceFunctions v0.1.5
  [2a0fbf3d] + CPUSummary v0.2.4
  [d35fcfd7] + CellArrays v0.2.1
  [fb6a15b2] + CloseOpenIntervals v0.1.12
  [34da2185] + Compat v4.14.0
  [adafc99b] + CpuId v0.3.1
  [615f187c] + IfElse v0.1.1
  [10f19ff3] + LayoutPointers v0.1.15
  [1914dd2f] + MacroTools v0.5.13
  [d125e4d3] + ManualMemory v0.1.8
  [f517fe37] + Polyester v0.7.13
  [1d0040c9] + PolyesterWeave v0.2.1
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [ae029012] + Requires v1.3.0
  [94e857df] + SIMDTypes v0.1.0
  [aedffcd0] + Static v0.8.10
  [0d7ed370] + StaticArrayInterface v1.5.0
  [90137ffa] + StaticArrays v1.9.3
  [1e83bf80] + StaticArraysCore v1.4.2
  [7792a7ef] + StrideArraysCore v0.5.6
  [8290d209] + ThreadingUtilities v0.5.2
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [d6f4376e] + Markdown
  [de0858da] + Printf
  [9a3f8284] + Random
  [ea8e919c] + SHA v0.7.0
  [9e88b42a] + Serialization
  [2f01184e] + SparseArrays v1.10.0
  [4607b0f0] + SuiteSparse
  [fa267f1f] + TOML v1.0.3
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll v1.1.0+0
  [4536629a] + OpenBLAS_jll v0.3.23+4
  [bea87d4a] + SuiteSparse_jll v7.2.1+1
  [8e850b90] + libblastrampoline_jll v5.8.0+1
Precompiling ParallelStencil
  1 dependency successfully precompiled in 3 seconds. 31 already precompiled.
  3.604517 seconds (557.00 k allocations: 60.079 MiB, 0.75% gc time, 1.75% compilation time)
@time Pkg.precompile("ParallelStencil")
  0.142622 seconds (42.95 k allocations: 6.031 MiB)

then I cloned ParallelStencil.jl in a fresh folder,

(@v1.10) pkg> activate .
  Activating project at `~/ParallelStencil.jl`

In [2]: using Pkg

In [3]: @time Pkg.precompile("ParallelStencil")

    Updating registry at `~/.julia/registries/General.toml`
    Updating `~/ParallelStencil.jl/Project.toml`
  [d35fcfd7] + CellArrays v0.2.1
  [1914dd2f] + MacroTools v0.5.13
  [90137ffa] + StaticArrays v1.9.3
    Updating `~/ParallelStencil.jl/Manifest.toml`
  [79e6a3ab] + Adapt v4.0.4
  [d35fcfd7] + CellArrays v0.2.1
  [1914dd2f] + MacroTools v0.5.13
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [ae029012] + Requires v1.3.0
  [90137ffa] + StaticArrays v1.9.3
  [1e83bf80] + StaticArraysCore v1.4.2
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [d6f4376e] + Markdown
  [de0858da] + Printf
  [9a3f8284] + Random
  [ea8e919c] + SHA v0.7.0
  [fa267f1f] + TOML v1.0.3
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll v1.1.0+0
  [4536629a] + OpenBLAS_jll v0.3.23+4
  [8e850b90] + libblastrampoline_jll v5.8.0+1
Precompiling ParallelStencil
  1 dependency successfully precompiled in 3 seconds. 9 already precompiled.
  7.199189 seconds (4.37 M allocations: 359.373 MiB, 5.64% gc time, 52.31% compilation time: 45% of which was recompilation)

@time Pkg.precompile("ParallelStencil")
  0.158071 seconds (12.55 k allocations: 1.745 MiB)

the precompile says 3 seconds and the @time 7seconds, which does not make sense. But both of them where 3 seconds I guess.

But I think Polyester was precompiled already. Should I remove Polyester from the julia environment and rerun those ?

svretina commented 6 months ago

for the precompilation time of @parallel :

@init_parallel_stencil(Threads, Float64, 3)
@parallel function axpy!(z, x, y, a)
           @all(z) = @all(x) * a + @all(y)
           return
        end
ERROR: LoadError: UndefVarError: `Polyester` not defined
in expression starting at /home/svretinaris/MyParallelStencil.jl/src/ParallelKernel/parallel.jl:476

(ParallelStencil) pkg> st
Project ParallelStencil v0.12.0
Status `~/MyParallelStencil.jl/Project.toml`
  [d35fcfd7] CellArrays v0.2.1
  [1914dd2f] MacroTools v0.5.13
  [f517fe37] Polyester v0.7.13
  [90137ffa] StaticArrays v1.9.3
  [9a3f8284] Random

using Polyester

@parallel function axpy!(z, x, y, a)
            @all(z) = @all(x) * a + @all(y)
            return
         end
axpy! (generic function with 1 method)

n = 64;

z, x, y = rand(n, n), rand(n, n), rand(n, n);

a = rand();

@time @parallel axpy!(z, x, y, a)
  0.536370 seconds (684.17 k allocations: 45.835 MiB, 2.87% gc time, 99.97% compilation time)

and for ParallelStencil.jl:

(@v1.10) pkg> activate .
  Activating project at `~/ParallelStencil.jl`

using ParallelStencil

using ParallelStencil.FiniteDifferences3D

@init_parallel_stencil(Threads, Float64, 3)

@parallel function axpy!(z, x, y, a)
           @all(z) = @all(x) * a + @all(y)
           return
        end
axpy! (generic function with 1 method)

n = 64;

z, x, y = rand(n, n), rand(n, n), rand(n, n);

a = rand();

@time @parallel axpy!(z, x, y, a)
  0.135649 seconds (125.64 k allocations: 8.546 MiB, 7.90% gc time, 99.83% compilation time)

I hope the timings I did, are done in a correct way

omlins commented 6 months ago

@svretina I'm working on adding an extension for polyester

svretina commented 6 months ago

So the initial timings I reported at the start of the thread ( not the compile timings ) were wrong.

I rerun the acoustic3D.jl miniapp, with a slight modification, the function compute_V! is rewritten with @parallel_indices instead of @parallel.

For the Base.Threads version I get

include("miniapps/acoustic3D.jl")
Total steps=1000, time=3.185e+01 sec (@ T_eff = 33.00 GB/s) 

and for the Polyester.@batch I get

include("/home/svretinaris/acoustic3D.jl")
Total steps=1000, time=1.961e+01 sec (@ T_eff = 54.00 GB/s) 

it is a 61% performance increase! this is by using 40 out of 48 cores on the machine. For 48 cores it goes up to 59 GB/s, the theoretical memory bandwidth for the machine I run it is around 70 GB/s

It would be interesting to report how close one would get to the T_peak limit in the first figure of the README.md.

svretina commented 6 months ago

I could squeeze a bit more performance and get to 60 GB/s on the same machine by adding

stride=true

to the @batch. More info about this option can be found here

svretina commented 6 months ago

I finally had time to play around interactively at a node (no interference from other users).

the node info:

versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 40 default, 0 interactive, 20 GC (on 80 virtual cores)
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

according to WikiChip this processor has a maximum memory bandwidth of 119.21 GB/s

In [3]: include("/home/svretinaris/acoustic3D.jl")
Total steps=1000, time=9.280e+00 sec (@ T_eff = 113.000 GB/s) 

In [5]: 113/119.21
0.9479070547772839

In [6]: acoustic3D()
Total steps=1000, time=1.004e+01 sec (@ T_eff = 105.000 GB/s) 

I don't know how to calculate the T_peak but it looks very close to optimal.

luraess commented 6 months ago

Indeed - fairly close 🚀

T_peak you could get doing only a memcpy like saxpy or similar, reading from one array into another making sure the memory operation actually happens (like by adding a scalar or so).

omlins commented 5 months ago

@svretina : https://github.com/omlins/ParallelStencil.jl/releases/tag/v0.13.0

svretina commented 5 months ago

amazing! looking forward using this!