omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
311 stars 31 forks source link

Multithreaded array initialization #68

Open carstenbauer opened 1 year ago

carstenbauer commented 1 year ago

For better performance on systems with multiple NUMA domains. See my extensive comment on discourse.

With this PR, I get about 40% speedup for this example (with USE_GPU=false) when using a full AMD Zen3 CPU (64 cores, 4 NUMA domains) of Noctua 2.

Timings (s) before

╭───────────┬─────────┬─────────┬─────────╮
│ # Threads │       1 │       8 │      64 │
├───────────┼─────────┼─────────┼─────────┤
│   compact │ 12.8708 │ 2.42357 │ 2.43713 │
│    spread │ 12.8708 │ 2.38331 │  3.3897 │
╰───────────┴─────────┴─────────┴─────────╯

Timings (s) after

╭───────────┬─────────┬─────────┬─────────╮
│ # Threads │       1 │       8 │      64 │
├───────────┼─────────┼─────────┼─────────┤
│   compact │ 12.8762 │ 2.41895 │ 1.51899 │
│    spread │ 12.8762 │ 2.35042 │ 2.08579 │
╰───────────┴─────────┴─────────┴─────────╯

Speedup in %

╭───────────┬─────┬─────┬──────╮
│ # Threads │   1 │   8 │   64 │
├───────────┼─────┼─────┼──────┤
│   compact │ 0.0 │ 0.0 │ 38.0 │
│    spread │ 0.0 │ 1.0 │ 38.0 │
╰───────────┴─────┴─────┴──────╯

NOTES:

cc @luraess @omlins

PS: Working on it at the GPU4GEO Hackathon in the Schwarzwald 😉

luraess commented 1 year ago

Thanks for the contribution. I guess having something in PS for the Threads backend to control pinning and threads to cores mapping (or have an close to optimal default solution) would be great! Especially for AMD cpus with many NUMA regions where this becomes significant.

carstenbauer commented 1 year ago

BTW, @omlins, depending on how easy/difficult it would be to give me test access to Piz Daint I could run some benchmarks there as well.

omlins commented 1 year ago

@carstenbauer, as Ludovic told you probably already, Piz Daint does not have any AMD CPUs. Thus, for testing this Superzack, Ludovic's cluster, will be better.

carstenbauer commented 1 year ago

I quickly tested another example, namely https://github.com/omlins/ParallelStencil.jl/blob/main/miniapps/acoustic3D.jl (with the visualization/animation part commented out. Same configuration as above, i.e. a 64 core node of Noctua 2 with 64 Julia threads that I pinned compactly. Below are the timings of the acoustic3D() function before and with this PR.

# Before PR: 44.315157 seconds (779.52 k allocations: 840.038 MiB, 1.09% gc time)
# With PR: 18.557505 seconds (791.20 k allocations: 840.475 MiB, 2.71% gc time)

This corresponds to about a 2.4x speedup. (cc @luraess)

omlins commented 1 year ago

This relates also to https://github.com/omlins/ParallelStencil.jl/issues/53#issuecomment-1086978245

carstenbauer commented 1 year ago

What's holding back merging this?

ranocha commented 11 months ago

Bump

luraess commented 10 months ago

@omlins bump