Open carstenbauer opened 1 year ago
Yeah, I never came back to these benchmarks after the first run, I'm definitely interested in trying again with your suggestions!
How do these benchmarks look: https://github.com/giordano/julia-on-fugaku/tree/cd5795aa746ec83286dc5d82aefdde50c56f74a3/benchmarks/bandwidthbenchmarkjl? :slightly_smiling_face: (These were run on Ookami, much quicker to get a node here.) I guess a direct comparison would be https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/Fujitsu-FX700-S1-M4-C48, except that one doesn't show SDaxpy scaling, which is what we have here instead.
Yes, that looks much better. Also, it seems to give O(700GB/s) which is reasonable I guess (ist it really GB or GiB?).
cc @vchuravy ☝️ Not sure why you did only get ~250GB/s in your webinar live demo. OTOH, looking at the RRZE benchmarks there seems to be quite some variation between different kernels.
(ist it really GB or GiB?).
Whatever BandwidthBenchmarks spits out, which according to the output is MB/s, not MiB/s.
OTOH, looking at the RRZE benchmarks there seems to be quite some variation between different kernels.
Is there a way to choose the kernel for the scaling plots in BandwidthBenchmarks? We only have sdaxpy, which is the only one not reported by RRZE, so making a comparison is hard. Also, they group processors by numa domains, which I guess avoids the ups and downs.
Is there a way to choose the kernel for the scaling plots in BandwidthBenchmarks?
No, it's currently hard-coded but could easily be extended in this direction. Feel free to open an issue / a PR.
Not sure if you still care about comments but here we go 😄:
For the bandwidth benchmark you have used
export JULIA_EXCLUSIVE=1
and thus compact pinning for the Julia threads. This explains (most of) the character of the flops plot (and probably also the bandwidth plot).This also almost certainly underestimates the achievable flops/bandwidth for a given number of threads (see https://github.com/JuliaPerf/BandwidthBenchmark.jl#compact-vs-scattered-pinning). Instead, I'd recommend to choose a scatter pinning that respects the NUMA structure. You can use ThreadPinning.jl for this (which you already have in the
Project.toml
), specificallypinthreads(:scatter; places=:numa)
.