npadmana / DistributedFFT

6 stars 2 forks source link

Gather some InfiniBand timings #49

Closed ronawho closed 4 years ago

ronawho commented 4 years ago

Spawned off from https://github.com/npadmana/DistributedFFT/issues/48#issuecomment-541276144 -- I haven't run this on our local cluster yet (and I don't have enough nodes to do a real stress test), but it would be interesting to see how the code does under gasnet (for a cray but also for ib).

All our timings so far have been with ugni on Cray-XCs. It'd be interesting to gather some IB results too.

I can do some smaller-scale IB runs. For sure 32 nodes, possibly 64. I would bet performance will currently lag on IB by ~2x. That's something we're actively working on. IB hardware has good bulk comm performance, so hopefully within the next release or 2 IB performance should be on par.

We have 2 modes we run gasnet with. Segment large, and segment fast. Segment large dynamically registers memory so it gets good NUMA affinity, but comm performance suffers. Segment fast statically registers at startup so NUMA affinity will be wrong, but comm will be fast. See the "InfiniBand" and "Performance Portability" sections of https://chapel-lang.org/releaseNotes/1.20/06-perf-opt.pdf. ISx performance will most closely match this application (in that we're doing lots of bulk comm as well as some local computations where numa affinity matters). On similar per-node hardware, ugni takes 9s, gasnet-ibv segment fast takes 16s, and gasnet-ibv segment-large takes 55s for ISx.

edit:

More details and a better summary of gn-ibv perf is captured in https://github.com/chapel-lang/chapel/issues/14438

ronawho commented 4 years ago

Here are some runs from a Cray-CS that has similar per-node hardware to the main Cray-XC runs we did (36-core 2.1 GHz Broadwell CPUs with 128 GB RAM). My runs are with Chapel 1.20, intel 19.0.3.199, and cray-fftw 3.3.8.4 (so similar software as well):

Size D:

nodes ugni gn-ib-lrg gn-ib-fast
8 32.63s 148.76s 37.93s
16 16.99s 83.42s 20.10s
32 10.40s 24.69s 10.69s

Size E:

nodes ugni gn-ib-lrg gn-ib-fast
16 162.79s 526.58s 204.00s
32 88.04s 295.76s 103.86s

gn-ib-large suffers, but gn-ib-fast performance is relatively close to ugni, which means that comm performance is more important than NUMA-affinity for this benchmark (which we knew already.)

(This matches what I was expecting and roughly tracks ISx ratios at 16 nodes.)

I don't think there's anything for you to do here. Improving IB performance is one of the things I'm working on for next release, and I expect this will fallout from this. This code is mostly doing bulk comm, and IB hardware is great at that, so we should be able to match Aries here. Note that this is 56 Gb FDR InfiniBand (newer hardware should be able to do even better.)

ronawho commented 4 years ago

Nightly results are available here

For now, I recommend using CHPL_GASNET_SEGMENT=fast for gasnet ibv.

ronawho commented 3 years ago

There have been some recent improvements to InfiniBand performance (and I'm expecting we'll have a 1.24.1 release that includes these.)

Reminder that we have 2 modes we can run gasnet ibv with: segment large, and segment fast. Segment large dynamically registers memory so it gets good NUMA affinity, but comm performance historically suffered. Segment fast statically registers at startup so NUMA affinity will be wrong and startup is slow, but comm will be fast.

https://github.com/chapel-lang/chapel/pull/17405 improves the startup time and slightly improves NUMA affinity for segment fast. https://github.com/chapel-lang/chapel/pull/17505 significantly improves comm performance for segment large.

The nightly results are available here. On 03/18 we see slight FT improvements for gn-ibv-fast due to slightly better NUMA affinity and significant reductions in startup time. On 04/01 we see significant FT improvements for gn-ibv-large due to comm improvements.

I no longer have access to a machine that has similar per node hardware to an XC so I can't gather ugni comparisons, but here are some large vs fast segment scaling comparisons on that same nightly machine:

Size D:

nodes gn-ibv-large gn-ibv-fast
8 34.76s 36.02s
16 19.29s 20.86s
32 11.24s 10.51s
64 5.82s 5.84s

Size E:

nodes gn-ibv-large gn-ibv-fast
16 206.37s 164.57s
32 106.59s 88.11s
64 59.49s 45.78s

For size D the runtimes are about the same (large slightly ahead because of better NUMA affinity). Segment large lags with larger problem sizes (and this isn't surprising, there is more work to do here) but it is significantly better than before. I'd probably still recommend CHPL_GASNET_SEGMENT=fast for this, startup times aren't bad anymore and you'll still get better performance at larger problem sizes.