Optimize GPU boundary exchanges via NVSHMEM

NVSHMEM is an implementation of OpenSHMEM for Nvidia GPUs:

https://developer.nvidia.com/nvshmem https://docs.nvidia.com/hpc-sdk/nvshmem/api/docs/index.html

It is essentially an alternative to MPI that allows the GPUs to communicate directly with the interconnect, instead of going through the CPU for MPI communications. The API is very similar to MPI but with slightly different terminology (init, finalize, PEs, teams, put/get, collective ops). Additionally, the memory model is slightly different.

This would be a great way to optimize the boundary exchanges, which currently represent the majority of communication overhead in the multi-GPU scenario. A big downside is that you probably can't have MPI and NSHMEM in the same program. You might be able to have a wrapper library that defers to either MPI or NVSHMEM based on whether or not GPUs are enabled, but more likely you will need to have separate binaries for cpu/gpu.

hemelb-codes / hemelb-gpu

Optimize GPU boundary exchanges via NVSHMEM #1