paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
149 stars 106 forks source link

Splitting grids #119

Closed paboyle closed 6 years ago

paboyle commented 7 years ago

feature/dwf-multirhs

Able to "split" grids, to implement subsets of nodes that do bits of work, using MPI comm split internally.

Presently bounce data between them through the file system, using SciDAC I/O records. tests/solver/Test_dwf_mrhs_cg.cc .

This on my laptop subdivides 2 ranks in to two independent CG's with different stopping conditions, both of which converge in different and decoupled iteration counts.

This is a precursor to getting valence production Jobs on CORI-2 running much more efficiently, and decoupling from the network.

Key idea is that you can pass a "Grid" object into the constructor. If the requested number of processors is reduced, it must subdivide the original. MPI_Comm_split is called as appropriate.

Rank mapping is not constrained to match between old and new communicators in anyway. At mercy of MPI here; can only communicate between them (presently) via filesystem. Since parallel IO is now fast, this might remain a long term option.

(However we also have the option of passing around entire sub volumes from the big decomposition to the smaller through the MPI network if this is required; just need to query what MPI did after the split is done since the standard doesn't guarantee anything about rank mappings.)

I've emphasised to USQCD/ECP the importance of remaining lean & flexible and assumption free on the communication interface. In addition to supporting SHMEM, this is yet ANOTHER example of why it is important. If I had adopted QMP we would have been completely unable to do this.

Works only under the --enable-comms=mpi target.

chulwoo1 commented 7 years ago

Hi,

I think I'll have to disagree on the last part - there was a considerable effort in QMP to be able to handle multiple communicators, first motivated by the need to be able to 'bundle' different jobs and the first solution implemented on CPS. There might be vaidity in Peter's general point, but I don't think it is really true on this point.

paboyle commented 7 years ago

Hi Chulwoo,

yes - looks like there is a lot of work on communicators in QMP of which I was unaware and that comes after my recollection where QMP_COMM_WORLD was just used.

Still think that having a separate and large library leaves much bigger barriers to such changes, though, and wouldn't have been able to do it in 24h.

paboyle commented 7 years ago

I looked through QMP again; I had reacted because the send primitives etc do not take a communicator argument to say which communicator.

Hower, since you can split and set the default communicator, you could manage in QMP by switching "mode" of the background library swapping th e default communicator to the split one and back again.

I prefer having multiple grid objects with the communicators contained within though since there is no global modality to the library, but the switch of mode doesn't need to be fine grained for the use case I'm advocating here.

paboyle commented 7 years ago

Running it on Cori now; first look suggests it worked but need to check the I/O performance.

Probably defaulting to the POSIX driver and the $HOME is limited to 100MB/s (which I got).

http://www.nersc.gov/users/computational-systems/cori/file-storage-and-i-o/

erinaldi commented 7 years ago

Hi Peter, I was going to add a similar feature request. I had suggested this to Guido today who told me to add an issue here on github. I am glad I discovered you had already been thinking about it. It is true that this is also accomplished (in some form) by QMP and parts of the MILC executables for RHMC can do this to bundle smaller jobs into large partitions. I have not looked into the way this is implemented, but at the level of the command line there is an extra parameter similar to the qmp-geom but for individual jobs to be bundled. My interest was in adding the serialization of the input files so that a single grid instance would be able to take several input files (for different parameters but fixed geometry) and split the work on subsets of nodes. (I see the point of the I/O performance though...) Thank you.

paboyle commented 6 years ago

Closing as the feature is implemented.