Closed maddyscientist closed 9 years ago
After consultation with Rolf vandeVaart, I now understand this is more complicated than I first thought.
OpenMPI's implementation of this does not allow for mixed MPI topologies, e.g., you cannot have a mix of GPU within a system that support peer-2-peer between each other, and some that do not. If OpenMPI were to be used on such a system, undefined behaviour would occur.
MVAPICH on the other hand, does support this, but this comes at the expense of having to set the cuda device before MPI is initialized. For this to occur, we have to ensure that initQuda() is called before MPI is initialized. Justin / Balint, is there any problem with this for MILC / Chroma respectivey? This will likely require some modification as to what initQuda does, perhaps splitting initQuda into two, since it currently expects that the machine's network topology has already been set.
(Copying this message from Balint so we don't forget it)
This is probably not relevant until the new version 0.5 is out, but I thought it may be useful for the future:
While looking at documentation about threading, I came accross this snippet in the MVAPICH2 user guide which may be relevant when you want to use MPI calls to communicate directly between GPUs (with a single MPI call)
GPU Affinity: When multiple GPUs are present on a node, users might want to set the MPI process affinity to a particular GPU using cuda calls like cudaSetDevice(). This can be done after MPI_Init based on MPI rank of the process. But MVAPICH2 performs some cuda operations like buffer registration and others during MPI_Init which result in default context creation. Hence, setting GPU Affinity after MPI_Init could create issues due to the context switch. To avoid this, MVAPICH2 provides an environment variable called MV2_COMM_WORLD_LOCAL_RANK to get the local rank of a process on its node before MPI_Init is called. This local rank information can be used to set GPU affinity before MPI_Init is called as given in the following code example … .... int local_rank = atoi(getenv(”MV2_COMM_WORLD_LOCAL_RANK”)); cudaSetDevice(local_rank % num_devices); ... MPI_Init() ... …
and slightly before:
For example, before MVAPICH2-1.8, a typical user might be using the following sequence of commands to move data from a device memory to another device memory. … cudaMemcpy(host_buf, device_buf, size, cudaMemcpyDeviceToDevice); MPI_Isend(host_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD, req); … With the support provided in MVAPICH2-1.8 and support of CUDA 4.0 (and later), the user can achieve the same data movement operation by explicitly specifying MPI calls on device memory. … MPI_Isend(device_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD, req); … This support can be enabled by configuring MVAPICH2 with --enable-cuda and setting the environment variable MV2_USE_CUDA ( 11.121) to 1 during runtime.
This creates an interesting issue: If we build MVAPICH2-1.8 with
--enable-cuda
then we will have to call qudaInit() before we call MPI_initi()/MPI_Init_threads() or QMP_initialize() since the MPI_init() will "result in default context creation"
The problem will be getting the (logical) node ID before calling MPI_Init(). MVAPICH provides a way, but it is not uniformly portable accross all MPIs.
Issue #84 blocks this issue.
An update on my thoughts here. Given the flaky state of GPU-aware MPI libraries (both OpenMPI and MVAPICH have their issues), I think the best approach is to first of all support the cudaIPC API which would give peer-2-peer functionality within multi-node boxes and give optimal performance. One would continue use MPI for inter-node communication. I think this should be fairly easy to achieve and will work using both OpenMPI and MVAPICH and give the code flexibility to work on machines with mixed IOH topologies.
I have begun work on this. One thing that came up is that if we will have to change the location of the FaceBuffer class. To support GPUDirect like functionality, the FaceBuffer class needs to be able to use the fields directly that constitute the cudaColorSpinorField class. Moreover, in order to use persistent send/receive (which is a design requirement), then we effectively need to be able to construct a different FaceBuffer for each cudaColorSpinorField, rather than a common one which presently lives in the Dirac operator class.
My proposed solution is to move the FaceBuffer class into the cudaColorSpinorField class. This can mean either an instance of FaceBuffer lives inside each cudaColorSpinorField instance, or we can move the functionality into the field and deprecate the FaceBuffer. Either way, each "FaceBuffer" will have direct access to the cudaColorspinorField data. This will not increase the device memory footprint since the input buffer will be the static ghost buffer and the output buffer is the ghost zone that is already allocated for every cudaColorSpinorField already. When not using GPUDirect, then we will use static buffers on the host to prevent host memory bloat.
I note that FaceBuffer functionality is also required for the gauge fields, but if we finish the merge of cudaColorSpinorField and LatticeField derivatives, then this becomes no issue and will give the code a big clean up.
Any thoughts?
I've always thought it would be more logical for FaceBuffer to reside in cudaColorSpinorField, so no objections here :-).
Here are the MVAPICH env variables that one should use in order to have cudaSetDevice called before MPI_Init
13.1 MV2 COMM WORLD LOCAL RANK
Class: Run time Applicable Interface(s): All
The local rank of a process on a node within its job. The local rank ranges from 0,1 ... N-1 on a node with N processes running on it.
13.2 MV2 COMM WORLD LOCAL SIZE
Class: Run time Applicable Interface(s): All
The number of ranks from this job that are running on this node.
13.3 MV2 COMM WORLD RANK
Class: Run time Applicable Interface(s): All
The MPI rank of this process in current MPI job
13.4 MV2 COMM WORLD SIZE
Class: Run time Applicable Interface(s): All
The number of processes in this MPI job’s MPI Comm World.
This feature is now present in the quda-0.7 branch, with support currently limited to the dslash and extended gauge field exchange. Thus GPUDirect RDMA is now supported, though performance is actually slower than without it. To enable this feature the --enabled-gpu-comms option should be specified at configure time (equivalently, set GPU_COMMS=yes in make.inc).
Leaving this issue open, since half precision is not yet supported (requires some further thought and clean up).
Closing this issue since half precision support is now included. There is some further optimization ongoing with respect to this (issue #172).
The latest OpenMPI and MVAPICH libraries support GPU pointer directly. This is something we should support in QUDA as soon as possible as it had a huge potential for better scaling:
I propose that this be a configure-time setting, with something like --enable-gpu-mpi, which will pass a C-processor flag which will remove the host <-> device memory copies, and pass the GPU pointers directly to the communications interface (QMP and MPI).