Gather some InfiniBand timings

ronawho commented 4 years ago

Spawned off from https://github.com/npadmana/DistributedFFT/issues/48#issuecomment-541276144 -- I haven't run this on our local cluster yet (and I don't have enough nodes to do a real stress test), but it would be interesting to see how the code does under gasnet (for a cray but also for ib).

All our timings so far have been with ugni on Cray-XCs. It'd be interesting to gather some IB results too.

I can do some smaller-scale IB runs. For sure 32 nodes, possibly 64. I would bet performance will currently lag on IB by ~2x. That's something we're actively working on. IB hardware has good bulk comm performance, so hopefully within the next release or 2 IB performance should be on par.

We have 2 modes we run gasnet with. Segment large, and segment fast. Segment large dynamically registers memory so it gets good NUMA affinity, but comm performance suffers. Segment fast statically registers at startup so NUMA affinity will be wrong, but comm will be fast. See the "InfiniBand" and "Performance Portability" sections of https://chapel-lang.org/releaseNotes/1.20/06-perf-opt.pdf. ISx performance will most closely match this application (in that we're doing lots of bulk comm as well as some local computations where numa affinity matters). On similar per-node hardware, ugni takes 9s, gasnet-ibv segment fast takes 16s, and gasnet-ibv segment-large takes 55s for ISx.

edit:

More details and a better summary of gn-ibv perf is captured in https://github.com/chapel-lang/chapel/issues/14438

ronawho commented 4 years ago

Here are some runs from a Cray-CS that has similar per-node hardware to the main Cray-XC runs we did (36-core 2.1 GHz Broadwell CPUs with 128 GB RAM). My runs are with Chapel 1.20, intel 19.0.3.199, and cray-fftw 3.3.8.4 (so similar software as well):

Size D:

nodes	ugni	gn-ib-lrg	gn-ib-fast
8	32.63s	148.76s	37.93s
16	16.99s	83.42s	20.10s
32	10.40s	24.69s	10.69s

Size E:

nodes	ugni	gn-ib-lrg	gn-ib-fast
16	162.79s	526.58s	204.00s
32	88.04s	295.76s	103.86s

gn-ib-large suffers, but gn-ib-fast performance is relatively close to ugni, which means that comm performance is more important than NUMA-affinity for this benchmark (which we knew already.)

(This matches what I was expecting and roughly tracks ISx ratios at 16 nodes.)

I don't think there's anything for you to do here. Improving IB performance is one of the things I'm working on for next release, and I expect this will fallout from this. This code is mostly doing bulk comm, and IB hardware is great at that, so we should be able to match Aries here. Note that this is 56 Gb FDR InfiniBand (newer hardware should be able to do even better.)

ronawho commented 4 years ago

Nightly results are available here

For now, I recommend using CHPL_GASNET_SEGMENT=fast for gasnet ibv.

ronawho commented 3 years ago

There have been some recent improvements to InfiniBand performance (and I'm expecting we'll have a 1.24.1 release that includes these.)

Reminder that we have 2 modes we can run gasnet ibv with: segment large, and segment fast. Segment large dynamically registers memory so it gets good NUMA affinity, but comm performance historically suffered. Segment fast statically registers at startup so NUMA affinity will be wrong and startup is slow, but comm will be fast.

https://github.com/chapel-lang/chapel/pull/17405 improves the startup time and slightly improves NUMA affinity for segment fast. https://github.com/chapel-lang/chapel/pull/17505 significantly improves comm performance for segment large.

The nightly results are available here. On 03/18 we see slight FT improvements for gn-ibv-fast due to slightly better NUMA affinity and significant reductions in startup time. On 04/01 we see significant FT improvements for gn-ibv-large due to comm improvements.

I no longer have access to a machine that has similar per node hardware to an XC so I can't gather ugni comparisons, but here are some large vs fast segment scaling comparisons on that same nightly machine:

Size D:

nodes	gn-ibv-large	gn-ibv-fast
8	34.76s	36.02s
16	19.29s	20.86s
32	11.24s	10.51s
64	5.82s	5.84s

Size E:

nodes	gn-ibv-large	gn-ibv-fast
16	206.37s	164.57s
32	106.59s	88.11s
64	59.49s	45.78s

For size D the runtimes are about the same (large slightly ahead because of better NUMA affinity). Segment large lags with larger problem sizes (and this isn't surprising, there is more work to do here) but it is significantly better than before. I'd probably still recommend CHPL_GASNET_SEGMENT=fast for this, startup times aren't bad anymore and you'll still get better performance at larger problem sizes.

npadmana / DistributedFFT