Closed mpiforumbot closed 8 years ago
Originally by dougmill AT us DOT ibm DOT COOM on 2010-03-17 12:45:55 -0500
This document is very rough and almost certainly needs help to conform to forum document standards.
Originally by dougmill AT us DOT ibm DOT COOM on 2010-04-12 08:17:58 -0500
v0.4: Fixed a typo in examples, and added several paragraphs preceding the PROPOSAL API section to help clarify the concept.
Originally by dougmill AT us DOT ibm DOT COOM on 2010-06-15 08:14:55 -0500
Attachment added: mpi3_hybrid_20100614.pdf
(33.2 KiB)
Topics for discussion/resolution at next meeting
Originally by dougmill AT us DOT ibm DOT COOM on 2010-10-26 13:05:48 -0500
Attachment added: mpi3_helperthreads.pdf
(80.8 KiB)
Proposal for Helper Threads, V0.7
Originally by dougmill on 2011-02-16 11:11:53 -0600
The proposal is currently being maintained in the "Extended Interfaces" section of the MPI Spec. See attachments (ei-2-vX.Y.pdf) on the MPI3 Hybrid main wiki for latest.
Originally by dougmill on 2011-03-02 09:55:35 -0600
I've got a sample implementation ready now. It is based on BG/P, DCMF, and MPICH2. This code only takes advantage of parallelism provided by the BG/P Collective Network device. For that reason, the test program only uses MPI_Allreduce to demonstrate performance differences. This code includes some changes to DCMF that expose hooks into the BG/P Collective Network parallelism features, and as such is provided as-is and is not intended for production use.
The MPI implementation is provided as MPIX extensions to the DCMFd component of MPICH2. Examining the patch file in comm/lib/mpich2/mpix_helperthreads.patch
is useful in gaining an overview of what changed and what was added.
To get the code and view it (if you don't intend to build and run), you can use this script:
#!/bin/sh
set -e
DIR=${HOME}/mpix # where you want to put the code
DCMF_TAR=${DIR}_downloads # where external pkgs are downloaded
GITREPO=http://dcmf.anl-external.org/dcmf.git
BRANCH=MPI3-Forum-ticket217-bgp
mpich_url=http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1/mpich2-1.1.tar.gz
export DCMF_SUBDIRS=mpich2
export DCMF_TAR
rm -rf ${DIR}
mkdir -p ${DIR}
if [[ ! -d ${DCMF_TAR} ]]; then
mkdir -p ${DCMF_TAR}
wget --directory-prefix=${DCMF_TAR} ${mpich_url}
fi
cd ${DIR}
git clone -n ${GITREPO} comm
cd comm
git checkout -b ${BRANCH} origin/${BRANCH}
cd lib
./configure
patch -p1 --force --directory=dev < mpich2/mpix_helperthreads.patch
If you have access to a BG/P FEN and want to build/run the code and/or test program, the following script should extract the code and build the libraries:
#!/bin/sh
set -e
DIR=${HOME}/mpix_helperthreads # where you want to build
FLOOR=/bgsys/drivers/ppcfloor # BG/P installation
DCMF_TAR=/bgsys/downloads/comm # where external pkgs are downloaded
GITREPO=http://dcmf.anl-external.org/dcmf.git
BRANCH=MPI3-Forum-ticket217-bgp
mpich_url=http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1/mpich2-1.1.tar.gz
export DCMF_SUBDIRS=mpich2
export DCMF_TAR
rm -rf ${DIR}
mkdir -p ${DIR}
if [[ ! -d ${DCMF_TAR} ]]; then
mkdir -p ${DCMF_TAR}
wget --directory-prefix=${DCMF_TAR} ${mpich_url}
fi
cd ${DIR}
ln -s ${FLOOR}/arch .
ln -s ${FLOOR}/runtime .
git clone -n ${GITREPO} comm
cd comm
git checkout -b ${BRANCH} origin/${BRANCH}
ln -sf Make.rules.floor Make.rules
make autoconf
cd lib
./configure
patch -p1 --force --directory=dev < mpich2/mpix_helperthreads.patch
cd ..
make lib
Both of the above scripts require that GIT is installed. The second (build) script requires a system that has the BG/P software environment installed.
To make the test program and run it (after building libraries), use the following commands:
cd build/mpich2/dcmf-8aint/test/mpid/dcmfd
make
cp mpix_helper_test_pthread /bgp/run/dir
cd /bgp/run/dir
mpirun -mode SMP ... -cwd ${PWD} -exe ${PWD}/mpix_helper_test_pthread
Add the commandline option "--nojoin" to disable use of MPIX_Helper_team_join/_leave. Attached is a graph of the results with, and without, JOIN/LEAVE.
Originally by dougmill on 2011-03-02 09:56:31 -0600
Attachment added: mpix_helper.png
(5.5 KiB)
Graph of results from test program
Originally by dougmill on 2011-07-22 12:58:00 -0500
Attachment added: mpi3_ticket217_helper.pdf
(68.0 KiB)
Proposal for MPI Helper Threads
Originally by dougmill on 2011-07-22 12:58:15 -0500
Updated proposed document to include motivation and a more-extensive example/tutorial
Originally by dougmill on 2011-08-26 07:07:43 -0500
Updated doc to clean up wording and add another paragraph to the motivation.
Originally by dougmill on 2011-08-26 07:09:33 -0500
Attachment added: ei-2.tex
(57.9 KiB)
External Interfaces chapter LaTeX with ticket 217 changes
Originally by dougmill on 2011-08-30 14:07:51 -0500
Updated doc with revised example for latest endpoints proposal details.
Originally by dougmill on 2011-09-09 08:26:35 -0500
I've done some more cleanup of the proposal, spelling corrections etc.
I also added an alternate proposal that allows MPI_TEAM_JOIN to specify the "working size" of the team while MPI_TEAM_CREATE specifies the "maximum team size". This works better with OpenMP since OpenMP does not guarantee the number of threads that will participant, until the parallel block is actually entered at run-time. This way the team can be created using some maximum size (omp_get_thread_limit()) and then each JOIN-LEAVE block selects the actual number of threads participating. However, it does require that all (matching) calls to JOIN have the same working team size - or creates overhead to error check that.
Originally by dougmill on 2011-09-13 07:50:36 -0500
Since we have not been able to discuss this in a meeting, I've setup a Straw Vote to decide if we should carry forward this proposal.
Please review the latest attachment (mpi3-ticket217-ei-2.pdf) and vote accordingly. Especially if you vote "No", please post why you object or what it is you object to. Thanks.
Note, you will need to login to the wiki before you can see, or cast, votes.
[[Poll(Should the "Teams" (Helper Threads) proposal be carried forward to a Forum reading and vote?; Yes; No)]]
Also, please take a look at the "alternate" proposal that makes the JOIN team size dynamic, and give a quick Yes/No on that as well.
[[Poll(Is the dynamic team size proposal worth considering?; Yes; No)]]
Originally by dougmill on 2011-09-16 15:32:41 -0500
I've started working with a prototype implementation for ticket #288 (MPI Endpoints) and this. The test program I'm currently using shows why it is highly desirable to have something like MPI_TEAM_SYNC, which is functionally equivalent to MPI_TEAM_LEAVE followed immediately by MPI_TEAM_JOIN. I'd like to add that back into the proposal, and have updated the docs accordingly.
Originally by dougmill on 2011-09-16 15:33:36 -0500
Attachment added: mpi3-ticket217-ei-2-alt.pdf
(238.0 KiB)
Alternate proposal for dynamic team size
Originally by dougmill on 2011-09-16 15:55:52 -0500
The situation that led to the desire for MPI_TEAM_SYNC was some OpenMP code like this:
#pragma omp parallel
{
MPI_Team_join(team);
...
do_function(...);
...
MPI_Team_leave(team);
}
void do_function(...)
{
...
# pragma omp barrier
# pragma omp master
{
MPI_Barrier(MPI_COMM_WORLD);
}
# pragma omp barrier
...
}
In this situation, especially with MPI_INFO balanced=true
, the non-master threads need to participate in the MPI_Barrier, even though they are not officially part of it. This could be accomplished using:
...
void do_function(...)
{
...
# pragma omp barrier
# pragma omp master
{
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Team_leave(team);
MPI_Team_join(team);
...
}
But is cleaner looking, easier to follow, and more efficient, if:
...
void do_function(...)
{
...
# pragma omp barrier
# pragma omp master
{
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Team_sync(team);
...
}
Note, the second omp barrier becomes redundant with MPI_TEAM_LEAVE or MPI_TEAM_SYNC.
Originally by dougmill on 2011-10-06 07:59:03 -0500
Pavan and Jim gave a "yes" vote to procede, but have reservations about the actual interface. We need to get some more detail on what about the interface is objectionable, and what a better interface would be like.
Please post to this ticket what you don't like about the interface, and/or what you think a better interface would be.
Originally by dougmill on 2011-10-18 15:50:54 -0500
I've added a newer version of the document. This includes the "alternate" proposal for MPI_TEAM_JOIN as well, separated and in blue.
This is the version we should review on friday, to have a first reading at the Forum next week (Oct 24-26)
Originally by dougmill on 2011-10-21 16:36:23 -0500
I've updated the document based on Pavan's feedback today. This is what I expect to read at the Forum on Wednesday. I've also attached some slides to be used as introduction and motivation.
Originally by moody20 on 2011-10-26 10:26:16 -0500
A bunch of Ticket 0 issues:
p16, 5 Does "hands-over" really need the '-'?
p16, 7 "outsome" --> "outcome"
p16, 12 "The Maximum" --> "The maximum"
p16, 24 "(ever)" --> "ever"
p17, 24 "(or MPI_TEAM_BREAK)" --> "or MPI_TEAM_BREAK"
p18, 24 MPI_TEAM_SYNC seems to be another way to LEAVE, so this call should be added to text that lists the ways out of a JOIN.
p18, 39 expand "PEs per process" to "hardware processing elements per MPI process"
p19, 28 add MPI_INFO_FREE call to not leak info object
p20, 20 "it's" --> "its"
be sure that any info keys are listed in A.1.5
Originally by jsquyres on 2011-10-26 11:11:44 -0500
From the reading of mpi3-ticket217-motivation.pdf on 26 Oct 2011:
Originally by moody20 on 2011-10-26 11:33:06 -0500
Since "balanced" really means "a promise not to call break", something like "no_break" is a better info key name. "balanced" does not really mean the work will be evenly distributed among the threads anyway, so the existing name is misleading and "no_break" is more descriptive for the user.
Originally by dougmill on 2011-10-27 11:32:02 -0500
I have uploaded a new version of the document.
I have temporoarely demoted the ticket 217 text/section to "blue" so that the recent changes can be seen. I have tagged all recent changes as "ticket 0" even though we may need to review whether that is in fact the case.
I have picked up most of the changes discussed, although we still have a few points to settle which we can discuss at our next meeting.
Originally by dougmill on 2011-10-28 07:49:42 -0500
Since it was brought up in the 1st reading, here's an example of how TBB might be used with MPI_TEAM functions:
class ThreadedAllreduce {
const MPI_Team _team;
const int _size;
public:
int result;
void operator() (const blocked_range<int> &r) {
int t = r.begin();
// assert(r.end() - r.begin() == 1) ?
MPI_Thread_attach(t, MPI_COMM_CLIQUE);
MPI_Team_join(_size, _team);
if (t == 0) {
MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, MPI_COMM_WORLD);
}
else {
// The remaining threads go directly to MPI_Team_leave
}
MPI_Team_leave(_team);
MPI_Thread_attach(0, MPI_COMM_CLIQUE);
result = 0;
}
ThreadedAllreduce(ThreadedAllreduce &x, split) :
_team(x._team),
_size(x._size),
result(1)
{}
void join(ThreadedAllreduce &y) {
result |= y.result;
}
ThreadedAllreduce(MPI_Team team, int size) :
_team(team),
_size(size),
result(1)
{}
}
int main(int argc, char **argv) {
int provided, N;
MPI_Team team;
MPI_Info info;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
MPI_Comm_size(MPI_COMM_CLIQUE, &N);
MPI_Info_create(&info);
MPI_Info_set(info, "balanced", "true");
MPI_Team_create(N, info, &team);
MPI_Info_free(&info);
ThreadedAllreduce thar(team,N);
parallel_for(blocked_range<int>(0,N,1), thar,
simple_partitioner());
MPI_Team_free(&team);
return thar.result;
}
Originally by dougmill on 2011-10-28 07:53:47 -0500
Attachment added: mpi3-ticket217-motivation.pdf
(59.8 KiB)
Motivational Slides for Ticket 217
Originally by dougmill on 2011-10-28 07:55:38 -0500
Updated motivation slides, in case they become relevant to discussions.
Originally by dougmill on 2011-10-28 08:16:26 -0500
Here are items from the 1st reading that need further discussion (potential agenda items for next meeting):
Originally by dougmill on 2011-11-03 09:43:04 -0500
I have committed this new section into a side subdir:
https://svn.mpi-forum.org/svn/mpi-forum-docs/trunk/working-groups/mpi-3/ex-intfc/ticket-217
If you checkout that and "cd chap-ei" and type "make" you will get a PDF of the External Interfaces chapter that contains this proposal. This is the document version as posted on 10/27/2011, svn revision 794.
Originally by dougmill on 2011-11-03 09:48:01 -0500
Regarding the discussion on the synchronization details in the info arg 'balanced', I still don't see how it makes sense to move that paragraph above the info arg, since the function being documented there is MPI_TEAM_CREATE which does not synchronize, and the JOIN and LEAVE functions have their own synchronization information. The paragraph does seem pertinent to the balanced key, but perhaps is not necessary. I think the more important information is that MPI_TEAM_BREAK is erroneous (not used) in that case and there should be some indication that the user is converging all threads in each JOIN-LEAVE block, that the performance of a block depends on all threads arriving promptly at the JOIN. Effectively thinking of the JOIN as synchronizing.
Originally by dougmill on 2011-12-12 10:42:58 -0600
I've attached the latest version of the External Interfaces chapter including this proposal for ticket #217. There are two versions, the "-ep" version includes an example that is only relavent if Ticket #288 (Endpoints) is also accepted.
Originally by jhammond on 2011-12-24 09:08:04 -0600
Hi Doug,
The script you provided to build on BGP fails at the MPICH2 build step on Surveyor.
I wouldn't mind testing this, especially for adverse side effects in threaded applications.
Jeff
Originally by dougmill on 2011-12-27 11:27:01 -0600
I'm not familiar with the build process on surveyor, and don't have access to that machine. You should be able to use any "normal" MPICH2 build process, as the code should just be inserted/integrated into the normal MPICH2 makefiles.
Originally by RolfRabenseifner on 2012-01-10 09:53:13 -0600
Some proposal that should be integrated for better readability, based on attachment:mpi3-ticket217-ei-2-ep.pdf
MPI_TEAM_JOIN, page 17, line 35, please add to the Advice to users:
To use the helper threads in collective algorithms, an MPI library is allowed to block all joined threads within the next call to MPI_TEAM_LEAVE or any other MPI call (which should be supported by this team) until all joined threads ave called an MPI routine, with one exception: a call to MPI_TEAM_BREAK must not be blocked.
p19:35 reads
{
but should read
{ /* or pragma omp single nowait */
After p19:38, one should add:
/* There must not be an omp barrier between the calls to MPI_Allreduce (by one thread of the team) and MPI_Team_leave (by all other threads of the team). Such a barrier may cause a deadlock because the MPI library is allowed to block until all members of the team have called an MPI routine. */
p20:26 reads
if (t=0) {
but should read
if (t=0) /* or pragma omp master or pragma omp single nowait */ {
Originally by RolfRabenseifner on 2012-01-10 11:46:23 -0600
Text update to previous comment:
MPI_TEAM_JOIN, page 17, line 35, please add to the Advice to users:
To use the helper threads in collective algorithms, an MPI library is allowed to block all joined threads within an MPI process in the next subsequent call to MPI_TEAM_LEAVE (which defines the calling thread as a helper thread) or any other MPI call (which defines the calling thread as it should be supported by the helper threads) until all joined threads ave called an MPI routine, with one exception: a call to MPI_TEAM_BREAK must not be blocked. Note that several threads may call MPI routines that should be supported by the same team of helper threads. If the teamsize includes the supported treads then they have to call NMPI_TEAM_LEAVE, too, after the supported MPI routine returned. Also note that the helper threads must not be prevented to call MPI_TEAM_LEAVE becaus otherwise a deadlock may occur, see Example 12.3.
Originally by jsquyres on 2012-01-10 15:08:33 -0600
Doug says the implementation is completed.
Originally by dougmill on 2012-01-11 08:39:42 -0600
I updated the proposal doc with the latest feedback. I renamed "balanced" to "nobreak". The proposal text is now BLUE with changes made today in purple/red for delete/add. I did not rename JOIN/LEAVE yet.
Originally by dougmill on 2012-02-01 15:35:08 -0600
Attachment added: mpi3-ticket217-examples.pdf
(37.0 KiB)
Some diagrams to help discussions
Originally by dougmill on 2012-02-01 15:36:51 -0600
Updated the proprosal with a few small changes. Added some slides for possible discussions
Originally by dougmill on 2012-02-03 16:58:08 -0600
Attachment added: mpi3-ticket217-ei-2.2.pdf
(227.2 KiB)
Ticket 217 additions to External Interfaces chapter
Originally by dougmill on 2012-02-03 16:59:26 -0600
Updated document to reflect "the two-function" design where MPI_TEAM_LEAVE takes an "option" argument that specifies the semantics of leaving the team epoch.
Originally by dougmill on 2012-02-13 09:15:24 -0600
uploaded the wrong file previously. This one has the latest changes for two-function interface.
Originally by dougmill on 2012-02-13 14:46:38 -0600
reviewed current text, recently changed to define "team epoch" and convert to the two-function interface.
Originally by dougmill on 2012-02-15 16:29:40 -0600
Attached new version of doc, with above changes.
Originally by moody20 on 2012-03-05 17:31:34 -0600
ticket 0:
Originally by RolfRabenseifner on 2012-03-05 18:34:21 -0600
As discussed while the March 2012 meeting, related to attachment:mpi3-ticket217-ei-2.pdf :
In the definition of MPI_TEAM_LEAVE p18:10-11, the major functionalities should be presented, i.e., the sentence p16:16-18
Unless the option MPI_TEAM_BREAK is used, a thread can exit from the MPI_TEAM_LEAVE call only after all threads participating in the team epoch have called MPI_TEAM_LEAVE.
should be repeated.
It should be clear that
"One team my support several threads issuing MPI calls."
One should modify this sentence for the case that #310 and #311 are coming:
"One team my support several threads issuing MPI calls. In the case of several MPI processes belonging to the same address space, the threads supported by one team must belong to the same MPI process."
I have one additional question:
Originally by dougmill on 2012-03-06 08:55:34 -0600
Attachment added: mpi3-ticket217-ei-2.pdf
(228.1 KiB)
Ticket 217 additions to External Interfaces chapter
Originally by dougmill on 2012-03-06 08:56:42 -0600
Updated document with changes from Formal Reading. I think these are all ticket-0 changes.
Originally by dougmill on 2012-03-06 09:11:13 -0600
I still maintain that Helper Teams should not place any restrictions on use of (multiple) MPI processes. Multpiple MPI processes have semantics that are independent of this, a team should be allowed to use that feature of their MPI implementation if desired.
Re: Rolf's question, MPI_TEAM_REJOIN provides synchronization between the team members while making progress on MPI calls. Consider this code:
if (thread == master) MPI_Barrier(MPI_COMM_WORLD);
pthread_barrier_wait(&sync_point1);
The above code may hange, if the MPI_Barrier depends on help from other threads. Using the following instead avoids the hang.
if (thread == master) MPI_Barrier(MPI_COMM_WORLD);
MPI_Team_leave(MPI_TEAM_REJOIN);
Regarding adding more text to explain the capabilities, I don't believe this is necessary. The opening paragraph states that the team handles multiple MPI operations inititated by one or more threads in the team, so that should be sufficient. The idea is that a team epoch may involve any valid set MPI operations. More examples could be added, if the forum decides that is helpful, but I don't think that should hold up this version of the proposal. It is always the case that additional, clarifying, examples can be added later. But if adding them would delay voting on this proposal then I resist.
Originally by dougmill AT us DOT ibm DOT COOM on 2010-03-17 12:42:50 -0500
This is an out-growth of Ticket #214 et al. This is now an independent proposal for consideration, possibly even for pre-MPI3 versions of the standard. See attached document for proposal.