Thread affinity in bandwidth benchmark

giordano / julia-on-fugaku

MIT License

10 stars 0 forks source link

Thread affinity in bandwidth benchmark #1

Open carstenbauer opened 1 year ago

carstenbauer commented 1 year ago

Not sure if you still care about comments but here we go 😄:

For the bandwidth benchmark you have used export JULIA_EXCLUSIVE=1 and thus compact pinning for the Julia threads. This explains (most of) the character of the flops plot (and probably also the bandwidth plot).

This also almost certainly underestimates the achievable flops/bandwidth for a given number of threads (see https://github.com/JuliaPerf/BandwidthBenchmark.jl#compact-vs-scattered-pinning). Instead, I'd recommend to choose a scatter pinning that respects the NUMA structure. You can use ThreadPinning.jl for this (which you already have in the Project.toml), specifically pinthreads(:scatter; places=:numa).

giordano commented 1 year ago

Yeah, I never came back to these benchmarks after the first run, I'm definitely interested in trying again with your suggestions!

giordano commented 1 year ago

How do these benchmarks look: https://github.com/giordano/julia-on-fugaku/tree/cd5795aa746ec83286dc5d82aefdde50c56f74a3/benchmarks/bandwidthbenchmarkjl? :slightly_smiling_face: (These were run on Ookami, much quicker to get a node here.) I guess a direct comparison would be https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/Fujitsu-FX700-S1-M4-C48, except that one doesn't show SDaxpy scaling, which is what we have here instead.

carstenbauer commented 1 year ago

Yes, that looks much better. Also, it seems to give O(700GB/s) which is reasonable I guess (ist it really GB or GiB?).

cc @vchuravy ☝️ Not sure why you did only get ~250GB/s in your webinar live demo. OTOH, looking at the RRZE benchmarks there seems to be quite some variation between different kernels.

giordano commented 1 year ago

(ist it really GB or GiB?).

Whatever BandwidthBenchmarks spits out, which according to the output is MB/s, not MiB/s.

OTOH, looking at the RRZE benchmarks there seems to be quite some variation between different kernels.

Is there a way to choose the kernel for the scaling plots in BandwidthBenchmarks? We only have sdaxpy, which is the only one not reported by RRZE, so making a comparison is hard. Also, they group processors by numa domains, which I guess avoids the ups and downs.

carstenbauer commented 1 year ago

Is there a way to choose the kernel for the scaling plots in BandwidthBenchmarks?

No, it's currently hard-coded but could easily be extended in this direction. Feel free to open an issue / a PR.