Allow user to specify logical topology for multi-GPU communications

At present, to properly run an application built with QUDA over QMP, it's necessary to specify "-geom Px Py Pz Pt" on the command-line. This is awkward in cases where the application has built-in logic to determine the best layout and is also incompatible with QDP/C, as summarized by James Osborn:

One issue with interfacing multi-GPU to QDP at the moment will be that QDP isn't currently setting the logical topology. This was changed to support multi-lattice in QDP, since one might not want the same node mapping on each lattice, and QMP didn't have communicator support. Now that QMP does, I could create a new communicator for each lattice and set each's topology, but my concern is that MPI communicators could be expensive in memory and I don't want to rely on this. I'm planning to add some sort of light-weight communicators to QMP to address this.

Another issue is that the QMP topology has it's own fixed mapping of the ranks to the logical topology, which may not be optimal. Right now QDP is using a different mapping which was a little better in some cases. I am also planning to allow the QMP mapping to be more flexible, but haven't gotten to this yet.

Anyway, the main point is that it would be nice if QUDA didn't rely on the QMP topology, but instead allowed the user to pass in a function (or functions) that specified the rank->coords and coords->rank mappings. That would allow much greater flexibility for the applications using QUDA. Additionally, allowing a QMP communicator to be specified would be ever better. You said that some groups may want to port QMP and not use communicators, but it should be possible for those ports to still keep the same API (with the communicator structure) and just have it always be the same one (basically make QMP_comm_split always fail).

At this stage, I'd suggest not going so far as to rely on QMP communicators, which are still an "alpha" feature, but allowing the user to pass in mapping function seems like a nice solution. This would also add much-needed flexibility to the MPI code path, which currently assumes a simple lexicographical ordering when assigning logical grid coordinates to MPI ranks.

To summarize, I propose replacing this declaration:

 void initCommsQuda(int argc, char **argv, const int *X, const int nDim);

with:

 typedef int (*QudaCommsMap)(const int *x, void *fdata);
 void initCommsQuda(const int *X, const int nDim, QudaCommsMap func, void *fdata);

Here fdata points to any auxiliary data required by the user-supplied mapping function func(). Passing NULL for fdata is perfectly valid. As an implementation detail, note that since we'll no longer be able to assume the existence of a QMP logical topology, we'll have to eliminate the use of "relative" sends and receives in face_qmp.cpp. This is a minor inconvenience but again quoting James Osborn:

The relative sends were just a cached version of the calculation of (get my coords) -> (add 1 mod length) -> (get rank). They aren't necessary (and were never used by QDP), since you can just create the neighbor table yourself and use the regular send.

Comments?

I recently became aware of this issue when trying to integrate Quda with MILC. I think the mapping function you suggest would be really helpful. J.

On Mon, Jan 30, 2012 at 07:19:52PM -0800, Ron Babich wrote:

At present, to properly run an application built with QUDA over QMP, it's necessary to specify "-geom Px Py Pz Pt" on the command-line. This is awkward in cases where the application has built-in logic to determine the best layout and is also incompatible with QDP/C, as summarized by James Osborn:

One issue with interfacing multi-GPU to QDP at the moment will be that QDP isn't currently setting the logical topology. This was changed to support multi-lattice in QDP, since one might not want the same node mapping on each lattice, and QMP didn't have communicator support. Now that QMP does, I could create a new communicator for each lattice and set each's topology, but my concern is that MPI communicators could be expensive in memory and I don't want to rely on this. I'm planning to add some sort of light-weight communicators to QMP to address this.

Another issue is that the QMP topology has it's own fixed mapping of the ranks to the logical topology, which may not be optimal. Right now QDP is using a different mapping which was a little better in some cases. I am also planning to allow the QMP mapping to be more flexible, but haven't gotten to this yet.

Anyway, the main point is that it would be nice if QUDA didn't rely on the QMP topology, but instead allowed the user to pass in a function (or functions) that specified the rank->coords and coords->rank mappings. That would allow much greater flexibility for the applications using QUDA. Additionally, allowing a QMP communicator to be specified would be ever better. You said that some groups may want to port QMP and not use communicators, but it should be possible for those ports to still keep the same API (with the communicator structure) and just have it always be the same one (basically make QMP_comm_split always fail).

At this stage, I'd suggest not going so far as to rely on QMP communicators, which are still an "alpha" feature, but allowing the user to pass in mapping function seems like a nice solution. This would also add much-needed flexibility to the MPI code path, which currently assumes a simple lexicographical ordering when assigning logical grid coordinates to MPI ranks.

To summarize, I propose replacing this declaration:
 void initCommsQuda(int argc, char **argv, const int *X, const int nDim);
with:
 typedef int (*QudaCommsMap)(const int *x, void *fdata);
 void initCommsQuda(const int *X, const int nDim, QudaCommsMap func, void *fdata);
Here fdata points to any auxiliary data required by the user-supplied mapping function func(). Passing NULL for fdata is perfectly valid. As an implementation detail, note that since we'll no longer be able to assume the existence of a QMP logical topology, we'll have to eliminate the use of "relative" sends and receives in face_qmp.cpp. This is a minor inconvenience but again quoting James Osborn:

The relative sends were just a cached version of the calculation of (get my coords) -> (add 1 mod length) -> (get rank). They aren't necessary (and were never used by QDP), since you can just create the neighbor table yourself and use the regular send.

Comments?

Reply to this email directly or view it on GitHub: https://github.com/lattice/quda/issues/46

All sounds readable to me. How much work is this?

On Jan 30, 2012, at 19:19, Ron Babichreply@reply.github.com wrote:

At present, to properly run an application built with QUDA over QMP, it's necessary to specify "-geom Px Py Pz Pt" on the command-line. This is awkward in cases where the application has built-in logic to determine the best layout and is also incompatible with QDP/C, as summarized by James Osborn:

One issue with interfacing multi-GPU to QDP at the moment will be that QDP isn't currently setting the logical topology. This was changed to support multi-lattice in QDP, since one might not want the same node mapping on each lattice, and QMP didn't have communicator support. Now that QMP does, I could create a new communicator for each lattice and set each's topology, but my concern is that MPI communicators could be expensive in memory and I don't want to rely on this. I'm planning to add some sort of light-weight communicators to QMP to address this.

Another issue is that the QMP topology has it's own fixed mapping of the ranks to the logical topology, which may not be optimal. Right now QDP is using a different mapping which was a little better in some cases. I am also planning to allow the QMP mapping to be more flexible, but haven't gotten to this yet.

Anyway, the main point is that it would be nice if QUDA didn't rely on the QMP topology, but instead allowed the user to pass in a function (or functions) that specified the rank->coords and coords->rank mappings. That would allow much greater flexibility for the applications using QUDA. Additionally, allowing a QMP communicator to be specified would be ever better. You said that some groups may want to port QMP and not use communicators, but it should be possible for those ports to still keep the same API (with the communicator structure) and just have it always be the same one (basically make QMP_comm_split always fail).

At this stage, I'd suggest not going so far as to rely on QMP communicators, which are still an "alpha" feature, but allowing the user to pass in mapping function seems like a nice solution. This would also add much-needed flexibility to the MPI code path, which currently assumes a simple lexicographical ordering when assigning logical grid coordinates to MPI ranks.

To summarize, I propose replacing this declaration:
void initCommsQuda(int argc, char **argv, const int *X, const int nDim);
with:
typedef int (*QudaCommsMap)(const int *x, void *fdata);
void initCommsQuda(const int *X, const int nDim, QudaCommsMap func, void *fdata);
Here fdata points to any auxiliary data required by the user-supplied mapping function func(). Passing NULL for fdata is perfectly valid. As an implementation detail, note that since we'll no longer be able to assume the existence of a QMP logical topology, we'll have to eliminate the use of "relative" sends and receives in face_qmp.cpp. This is a minor inconvenience but again quoting James Osborn:

The relative sends were just a cached version of the calculation of (get my coords) -> (add 1 mod length) -> (get rank). They aren't necessary (and were never used by QDP), since you can just create the neighbor table yourself and use the regular send.

Comments?

Reply to this email directly or view it on GitHub: https://github.com/lattice/quda/issues/46

This is easy, I think, but I want Balint and Guochun to sign off first, since it requires corresponding changes to Chroma and MILC.

Note that the user-supplied func() can be a simple wrapper to QMP_get_node_number_from(coords) for anyone who wants to keep doing things the old way. This might be a good option for Chroma.

Should this be 0.4.0, wouldn't 0.4.1 be more appropriate?

Adding another reason this: this makes multi-GPU in QUDA for BQCD much less hacky (issue 73). To enable support for it currently in BQCD I have to add a comm_set_gridsize interface to the outside world so that BQCD can communicate its MPI topology to QUDA.

I'm about to push a commit that implements this. From quda.h:

/**
 * initCommsGridQuda() takes an optional "rank_from_coords" argument that
 * should be a pointer to a user-defined function with this prototype.  
 *
 * @param coords  Node coordinates
 * @param fdata   Any auxiliary data needed by the function
 * @return        MPI rank or QMP node ID cooresponding to the node coordinates
 *
 * @see initCommsGridQuda
 */
typedef int (*QudaCommsMap)(const int *coords, void *fdata);

/**
 * Declare the grid mapping ("logical topology" in QMP parlance)
 * used for communications in a multi-GPU grid.  This function
 * should be called prior to initQuda().  The only case in which
 * it's optional is when QMP is used for communication and the
 * logical topology has already been declared by the application.
 *
 * @param nDim   Number of grid dimensions.  "4" is the only supported
 *               value currently.
 *
 * @param dims   Array of grid dimensions.  dims[0]*dims[1]*dims[2]*dims[3]
 *               must equal the total number of MPI ranks or QMP nodes.
 *
 * @param func   Pointer to a user-supplied function that maps coordinates
 *               in the communication grid to MPI ranks (or QMP node IDs).
 *               If the pointer is NULL, the default mapping depends on
 *               whether QMP or MPI is being used for communication.  With
 *               QMP, the existing logical topology is used if it's been
 *               declared.  With MPI or as a fallback with QMP, the default
 *               ordering is lexicographical with the fourth ("t") index
 *               varying fastest.
 *
 * @param fdata  Pointer to any data required by "func" (may be NULL)               
 *
 * @see QudaCommsMap
 */
void initCommsGridQuda(int nDim, const int *dims, QudaCommsMap func, void *fdata);

lattice / quda

Allow user to specify logical topology for multi-GPU communications #46