mpi-forum / mpi-forum-historic

Migration of old MPI Forum Trac Tickets to GitHub. New issues belong on mpi-forum/mpi-issues.
http://www.mpi-forum.org
2 stars 3 forks source link

MPI3 Hybrid Programming: Proposal for Helper Threads #217

Closed mpiforumbot closed 8 years ago

mpiforumbot commented 8 years ago

Originally by dougmill AT us DOT ibm DOT COOM on 2010-03-17 12:42:50 -0500


This is an out-growth of Ticket #214 et al. This is now an independent proposal for consideration, possibly even for pre-MPI3 versions of the standard. See attached document for proposal.

mpiforumbot commented 8 years ago

Originally by dougmill AT us DOT ibm DOT COOM on 2010-03-17 12:45:55 -0500


This document is very rough and almost certainly needs help to conform to forum document standards.

mpiforumbot commented 8 years ago

Originally by dougmill AT us DOT ibm DOT COOM on 2010-04-12 08:17:58 -0500


v0.4: Fixed a typo in examples, and added several paragraphs preceding the PROPOSAL API section to help clarify the concept.

mpiforumbot commented 8 years ago

Originally by dougmill AT us DOT ibm DOT COOM on 2010-06-15 08:14:55 -0500


Attachment added: mpi3_hybrid_20100614.pdf (33.2 KiB) Topics for discussion/resolution at next meeting

mpiforumbot commented 8 years ago

Originally by dougmill AT us DOT ibm DOT COOM on 2010-10-26 13:05:48 -0500


Attachment added: mpi3_helperthreads.pdf (80.8 KiB) Proposal for Helper Threads, V0.7

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-02-16 11:11:53 -0600


The proposal is currently being maintained in the "Extended Interfaces" section of the MPI Spec. See attachments (ei-2-vX.Y.pdf) on the MPI3 Hybrid main wiki for latest.

MPI3 Hybrid Wiki

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-03-02 09:55:35 -0600


I've got a sample implementation ready now. It is based on BG/P, DCMF, and MPICH2. This code only takes advantage of parallelism provided by the BG/P Collective Network device. For that reason, the test program only uses MPI_Allreduce to demonstrate performance differences. This code includes some changes to DCMF that expose hooks into the BG/P Collective Network parallelism features, and as such is provided as-is and is not intended for production use.

The MPI implementation is provided as MPIX extensions to the DCMFd component of MPICH2. Examining the patch file in comm/lib/mpich2/mpix_helperthreads.patch is useful in gaining an overview of what changed and what was added.

To get the code and view it (if you don't intend to build and run), you can use this script:

#!/bin/sh
set -e

DIR=${HOME}/mpix           # where you want to put the code
DCMF_TAR=${DIR}_downloads  # where external pkgs are downloaded

GITREPO=http://dcmf.anl-external.org/dcmf.git
BRANCH=MPI3-Forum-ticket217-bgp

mpich_url=http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1/mpich2-1.1.tar.gz

export DCMF_SUBDIRS=mpich2
export DCMF_TAR
rm -rf ${DIR}
mkdir -p ${DIR}
if [[ ! -d ${DCMF_TAR} ]]; then
        mkdir -p ${DCMF_TAR}
        wget --directory-prefix=${DCMF_TAR} ${mpich_url}
fi
cd ${DIR}
git clone -n ${GITREPO} comm
cd comm
git checkout -b ${BRANCH} origin/${BRANCH}
cd lib
./configure
patch -p1 --force --directory=dev < mpich2/mpix_helperthreads.patch

If you have access to a BG/P FEN and want to build/run the code and/or test program, the following script should extract the code and build the libraries:

#!/bin/sh
set -e

DIR=${HOME}/mpix_helperthreads  # where you want to build
FLOOR=/bgsys/drivers/ppcfloor   # BG/P installation
DCMF_TAR=/bgsys/downloads/comm  # where external pkgs are downloaded

GITREPO=http://dcmf.anl-external.org/dcmf.git
BRANCH=MPI3-Forum-ticket217-bgp

mpich_url=http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1/mpich2-1.1.tar.gz

export DCMF_SUBDIRS=mpich2
export DCMF_TAR
rm -rf ${DIR}
mkdir -p ${DIR}
if [[ ! -d ${DCMF_TAR} ]]; then
        mkdir -p ${DCMF_TAR}
        wget --directory-prefix=${DCMF_TAR} ${mpich_url}
fi
cd ${DIR}
ln -s ${FLOOR}/arch .
ln -s ${FLOOR}/runtime .
git clone -n ${GITREPO} comm
cd comm
git checkout -b ${BRANCH} origin/${BRANCH}
ln -sf Make.rules.floor Make.rules
make autoconf
cd lib
./configure
patch -p1 --force --directory=dev < mpich2/mpix_helperthreads.patch
cd ..
make lib

Both of the above scripts require that GIT is installed. The second (build) script requires a system that has the BG/P software environment installed.

To make the test program and run it (after building libraries), use the following commands:

cd build/mpich2/dcmf-8aint/test/mpid/dcmfd
make
cp mpix_helper_test_pthread /bgp/run/dir
cd /bgp/run/dir
mpirun -mode SMP ... -cwd ${PWD} -exe ${PWD}/mpix_helper_test_pthread

Add the commandline option "--nojoin" to disable use of MPIX_Helper_team_join/_leave. Attached is a graph of the results with, and without, JOIN/LEAVE.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-03-02 09:56:31 -0600


Attachment added: mpix_helper.png (5.5 KiB) mpix_helper.png Graph of results from test program

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-07-22 12:58:00 -0500


Attachment added: mpi3_ticket217_helper.pdf (68.0 KiB) Proposal for MPI Helper Threads

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-07-22 12:58:15 -0500


Updated proposed document to include motivation and a more-extensive example/tutorial

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-08-26 07:07:43 -0500


Updated doc to clean up wording and add another paragraph to the motivation.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-08-26 07:09:33 -0500


Attachment added: ei-2.tex (57.9 KiB) External Interfaces chapter LaTeX with ticket 217 changes

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-08-30 14:07:51 -0500


Updated doc with revised example for latest endpoints proposal details.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-09-09 08:26:35 -0500


I've done some more cleanup of the proposal, spelling corrections etc.

I also added an alternate proposal that allows MPI_TEAM_JOIN to specify the "working size" of the team while MPI_TEAM_CREATE specifies the "maximum team size". This works better with OpenMP since OpenMP does not guarantee the number of threads that will participant, until the parallel block is actually entered at run-time. This way the team can be created using some maximum size (omp_get_thread_limit()) and then each JOIN-LEAVE block selects the actual number of threads participating. However, it does require that all (matching) calls to JOIN have the same working team size - or creates overhead to error check that.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-09-13 07:50:36 -0500


Since we have not been able to discuss this in a meeting, I've setup a Straw Vote to decide if we should carry forward this proposal.

Please review the latest attachment (mpi3-ticket217-ei-2.pdf) and vote accordingly. Especially if you vote "No", please post why you object or what it is you object to. Thanks.

Note, you will need to login to the wiki before you can see, or cast, votes.

[[Poll(Should the "Teams" (Helper Threads) proposal be carried forward to a Forum reading and vote?; Yes; No)]]

Also, please take a look at the "alternate" proposal that makes the JOIN team size dynamic, and give a quick Yes/No on that as well.

[[Poll(Is the dynamic team size proposal worth considering?; Yes; No)]]

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-09-16 15:32:41 -0500


I've started working with a prototype implementation for ticket #288 (MPI Endpoints) and this. The test program I'm currently using shows why it is highly desirable to have something like MPI_TEAM_SYNC, which is functionally equivalent to MPI_TEAM_LEAVE followed immediately by MPI_TEAM_JOIN. I'd like to add that back into the proposal, and have updated the docs accordingly.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-09-16 15:33:36 -0500


Attachment added: mpi3-ticket217-ei-2-alt.pdf (238.0 KiB) Alternate proposal for dynamic team size

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-09-16 15:55:52 -0500


The situation that led to the desire for MPI_TEAM_SYNC was some OpenMP code like this:

#pragma omp parallel
{
    MPI_Team_join(team);
    ...
    do_function(...);
    ...
    MPI_Team_leave(team);
}

void do_function(...)
{
    ...
#   pragma omp barrier
#   pragma omp master
    {
        MPI_Barrier(MPI_COMM_WORLD);
    }
#   pragma omp barrier
    ...
}

In this situation, especially with MPI_INFO balanced=true, the non-master threads need to participate in the MPI_Barrier, even though they are not officially part of it. This could be accomplished using:

...

void do_function(...)
{
    ...
#   pragma omp barrier
#   pragma omp master
    {
        MPI_Barrier(MPI_COMM_WORLD);
    }
    MPI_Team_leave(team);
    MPI_Team_join(team);
    ...
}

But is cleaner looking, easier to follow, and more efficient, if:

...

void do_function(...)
{
    ...
#   pragma omp barrier
#   pragma omp master
    {
        MPI_Barrier(MPI_COMM_WORLD);
    }
    MPI_Team_sync(team);
    ...
}

Note, the second omp barrier becomes redundant with MPI_TEAM_LEAVE or MPI_TEAM_SYNC.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-06 07:59:03 -0500


Pavan and Jim gave a "yes" vote to procede, but have reservations about the actual interface. We need to get some more detail on what about the interface is objectionable, and what a better interface would be like.

Please post to this ticket what you don't like about the interface, and/or what you think a better interface would be.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-18 15:50:54 -0500


I've added a newer version of the document. This includes the "alternate" proposal for MPI_TEAM_JOIN as well, separated and in blue.

This is the version we should review on friday, to have a first reading at the Forum next week (Oct 24-26)

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-21 16:36:23 -0500


I've updated the document based on Pavan's feedback today. This is what I expect to read at the Forum on Wednesday. I've also attached some slides to be used as introduction and motivation.

mpiforumbot commented 8 years ago

Originally by moody20 on 2011-10-26 10:26:16 -0500


A bunch of Ticket 0 issues:

p16, 5 Does "hands-over" really need the '-'?

p16, 7 "outsome" --> "outcome"

p16, 12 "The Maximum" --> "The maximum"

p16, 24 "(ever)" --> "ever"

p17, 24 "(or MPI_TEAM_BREAK)" --> "or MPI_TEAM_BREAK"

p18, 24 MPI_TEAM_SYNC seems to be another way to LEAVE, so this call should be added to text that lists the ways out of a JOIN.

p18, 39 expand "PEs per process" to "hardware processing elements per MPI process"

p19, 28 add MPI_INFO_FREE call to not leak info object

p20, 20 "it's" --> "its"

be sure that any info keys are listed in A.1.5

mpiforumbot commented 8 years ago

Originally by jsquyres on 2011-10-26 11:11:44 -0500


From the reading of mpi3-ticket217-motivation.pdf on 26 Oct 2011:

mpiforumbot commented 8 years ago

Originally by moody20 on 2011-10-26 11:33:06 -0500


Since "balanced" really means "a promise not to call break", something like "no_break" is a better info key name. "balanced" does not really mean the work will be evenly distributed among the threads anyway, so the existing name is misleading and "no_break" is more descriptive for the user.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-27 11:32:02 -0500


I have uploaded a new version of the document.

I have temporoarely demoted the ticket 217 text/section to "blue" so that the recent changes can be seen. I have tagged all recent changes as "ticket 0" even though we may need to review whether that is in fact the case.

I have picked up most of the changes discussed, although we still have a few points to settle which we can discuss at our next meeting.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-28 07:49:42 -0500


Since it was brought up in the 1st reading, here's an example of how TBB might be used with MPI_TEAM functions:

class ThreadedAllreduce {
        const MPI_Team _team;
        const int _size;
public:
        int result; 

        void operator() (const blocked_range<int> &r) {
                int t = r.begin();
                // assert(r.end() - r.begin() == 1) ? 
                MPI_Thread_attach(t, MPI_COMM_CLIQUE);
                MPI_Team_join(_size, _team); 
                if (t == 0) { 
                        MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, MPI_COMM_WORLD);
                }
                else {
                        // The remaining threads go directly to MPI_Team_leave
                }               
                MPI_Team_leave(_team);
                MPI_Thread_attach(0, MPI_COMM_CLIQUE);
                result = 0;
        }

        ThreadedAllreduce(ThreadedAllreduce &x, split) :
                _team(x._team),
                _size(x._size),
                result(1)
        {}              

        void join(ThreadedAllreduce &y) {
                result |= y.result;
        }               

        ThreadedAllreduce(MPI_Team team, int size) :
                _team(team),
                _size(size),
                result(1)
        {}              
}

int main(int argc, char **argv) {
        int provided, N;
        MPI_Team team;
        MPI_Info info;

        MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
        MPI_Comm_size(MPI_COMM_CLIQUE, &N);

        MPI_Info_create(&info);
        MPI_Info_set(info, "balanced", "true");
        MPI_Team_create(N, info, &team);
        MPI_Info_free(&info);

        ThreadedAllreduce thar(team,N);
        parallel_for(blocked_range<int>(0,N,1), thar,
                                        simple_partitioner());

        MPI_Team_free(&team);                   
        return thar.result;
}
mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-28 07:53:47 -0500


Attachment added: mpi3-ticket217-motivation.pdf (59.8 KiB) Motivational Slides for Ticket 217

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-28 07:55:38 -0500


Updated motivation slides, in case they become relevant to discussions.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-10-28 08:16:26 -0500


Here are items from the 1st reading that need further discussion (potential agenda items for next meeting):

  1. rename "balanced" info key to "no_break" (or ?) a. Need to discuss the other implications of "balanced", w.r.t. what happens in non-TEAM MPI calls. For example, a call to MPI_Allreduce can assume that all threads will participate and so it will block until all threads arrive.
  2. It was suggested that MPI_TEAM_SYNC be listed along with "ways out of" MPI_TEAM_JOIN. a. Since SYNC leaves the thread joined, it isn't really a "way out".
  3. Question: "does the person freeing the team need to be a member?" a. I think this is covered, and adding more text just makes it confusing. Added a sentence to define when a thread is "a member of a team", which excludes CREATE and FREE.
  4. Question: "are teams refcounted?", some suggestion was made that they had to be. a. I would rather not specify, the text already states how FREE must work in both advice to users and implementers.
  5. Topic of blocking MPI operations, deadlock, keeping threads working - threads must yield? a. I don't think we should specify. I consider this proposal to be an extension of the underlying "progress engine" of an implementation and, as such, would behave according to the design of a given implementation.
  6. A long discussion ensued about whether a thread in MPI_TEAM_LEAVE might perform work for something outside of the team, and what - if anything - could be said about what might delay a thread from exiting LEAVE. a. Does this discussion need to continue? It was not clear just what needed to be resolved.
  7. Need some recommendation on how to "conditionally expose" the example using endpoints. a. Are there latex macros for this? Is there a precedence elsewhere?
mpiforumbot commented 8 years ago

Originally by dougmill on 2011-11-03 09:43:04 -0500


I have committed this new section into a side subdir:

https://svn.mpi-forum.org/svn/mpi-forum-docs/trunk/working-groups/mpi-3/ex-intfc/ticket-217

If you checkout that and "cd chap-ei" and type "make" you will get a PDF of the External Interfaces chapter that contains this proposal. This is the document version as posted on 10/27/2011, svn revision 794.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-11-03 09:48:01 -0500


Regarding the discussion on the synchronization details in the info arg 'balanced', I still don't see how it makes sense to move that paragraph above the info arg, since the function being documented there is MPI_TEAM_CREATE which does not synchronize, and the JOIN and LEAVE functions have their own synchronization information. The paragraph does seem pertinent to the balanced key, but perhaps is not necessary. I think the more important information is that MPI_TEAM_BREAK is erroneous (not used) in that case and there should be some indication that the user is converging all threads in each JOIN-LEAVE block, that the performance of a block depends on all threads arriving promptly at the JOIN. Effectively thinking of the JOIN as synchronizing.

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-12-12 10:42:58 -0600


I've attached the latest version of the External Interfaces chapter including this proposal for ticket #217. There are two versions, the "-ep" version includes an example that is only relavent if Ticket #288 (Endpoints) is also accepted.

mpiforumbot commented 8 years ago

Originally by jhammond on 2011-12-24 09:08:04 -0600


Hi Doug,

The script you provided to build on BGP fails at the MPICH2 build step on Surveyor.

I wouldn't mind testing this, especially for adverse side effects in threaded applications.

Jeff

mpiforumbot commented 8 years ago

Originally by dougmill on 2011-12-27 11:27:01 -0600


I'm not familiar with the build process on surveyor, and don't have access to that machine. You should be able to use any "normal" MPICH2 build process, as the code should just be inserted/integrated into the normal MPICH2 makefiles.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-10 09:53:13 -0600


Some proposal that should be integrated for better readability, based on attachment:mpi3-ticket217-ei-2-ep.pdf

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-10 11:46:23 -0600


Text update to previous comment:

mpiforumbot commented 8 years ago

Originally by jsquyres on 2012-01-10 15:08:33 -0600


Doug says the implementation is completed.

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-01-11 08:39:42 -0600


I updated the proposal doc with the latest feedback. I renamed "balanced" to "nobreak". The proposal text is now BLUE with changes made today in purple/red for delete/add. I did not rename JOIN/LEAVE yet.

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-01 15:35:08 -0600


Attachment added: mpi3-ticket217-examples.pdf (37.0 KiB) Some diagrams to help discussions

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-01 15:36:51 -0600


Updated the proprosal with a few small changes. Added some slides for possible discussions

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-03 16:58:08 -0600


Attachment added: mpi3-ticket217-ei-2.2.pdf (227.2 KiB) Ticket 217 additions to External Interfaces chapter

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-03 16:59:26 -0600


Updated document to reflect "the two-function" design where MPI_TEAM_LEAVE takes an "option" argument that specifies the semantics of leaving the team epoch.

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-13 09:15:24 -0600


uploaded the wrong file previously. This one has the latest changes for two-function interface.

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-13 14:46:38 -0600


reviewed current text, recently changed to define "team epoch" and convert to the two-function interface.

  1. In the description for MPI_TEAM_DEFAULT: Change "The calling thread may be asked to do work..." to "The calling thread may be used by the MPI implementation to do work..." Similarly, in the description for MPI_TEAM_REJOIN.
  2. For MPI_TEAM_REJOIN, change "This option may avoid overhead releasing" to "This option may avoid the overhead of releasing"
  3. Also in REJOIN, add wording to say that it re-establishes the original teamsize, and remove that from the advice to users.
  4. In the example, declare omp shared() variables as appropriate and add a comment to the LEAVE call saying that that is where the threads contribute/perform work.
mpiforumbot commented 8 years ago

Originally by dougmill on 2012-02-15 16:29:40 -0600


Attached new version of doc, with above changes.

mpiforumbot commented 8 years ago

Originally by moody20 on 2012-03-05 17:31:34 -0600


ticket 0:

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-03-05 18:34:21 -0600


As discussed while the March 2012 meeting, related to attachment:mpi3-ticket217-ei-2.pdf :

I have one additional question:

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-03-06 08:55:34 -0600


Attachment added: mpi3-ticket217-ei-2.pdf (228.1 KiB) Ticket 217 additions to External Interfaces chapter

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-03-06 08:56:42 -0600


Updated document with changes from Formal Reading. I think these are all ticket-0 changes.

mpiforumbot commented 8 years ago

Originally by dougmill on 2012-03-06 09:11:13 -0600


I still maintain that Helper Teams should not place any restrictions on use of (multiple) MPI processes. Multpiple MPI processes have semantics that are independent of this, a team should be allowed to use that feature of their MPI implementation if desired.

Re: Rolf's question, MPI_TEAM_REJOIN provides synchronization between the team members while making progress on MPI calls. Consider this code:

if (thread == master) MPI_Barrier(MPI_COMM_WORLD);
pthread_barrier_wait(&sync_point1);

The above code may hange, if the MPI_Barrier depends on help from other threads. Using the following instead avoids the hang.

if (thread == master) MPI_Barrier(MPI_COMM_WORLD);
MPI_Team_leave(MPI_TEAM_REJOIN);

Regarding adding more text to explain the capabilities, I don't believe this is necessary. The opening paragraph states that the team handles multiple MPI operations inititated by one or more threads in the team, so that should be sufficient. The idea is that a team epoch may involve any valid set MPI operations. More examples could be added, if the forum decides that is helpful, but I don't think that should hold up this version of the proposal. It is always the case that additional, clarifying, examples can be added later. But if adding them would delay voting on this proposal then I resist.