Closed mpichbot closed 7 years ago
Originally by blocksom on 2013-04-12 09:26:43 -0500
Attachment added: nbc_collectives.patch
(6.4 KiB)
Originally by balaji on 2013-05-09 21:18:51 -0500
Perhaps I'm not fully understanding the problem here. Why can't your optimized implementation return a dummy schedule and ignore poking progress for that schedule?
Originally by blocksom on 2013-05-10 16:02:29 -0500
How does one "ignore poking progress for that schedule"?
When I last looked at the code the MPIDU_Sched_progress (&made_progress)
had to be called during the progress loop or else the nbc schedules would not advance and the non-blocking collectives would hang. I had to put this function in a PAMI "work function" and queue it up so that it would be called when the PAMI contexts were advanced (via PAMI_Context_advance()
.
Obviously, this adds code in the critical path for pamid. So I only enqueue this work function to advance the nbc schedules if an environment variable is set.
I don't understand what a dummy schedule for pami-optimized collectives would accomplish since this MPIDU_Sched_progress (&made_progress)
function would need to be invoked to cover the other mpich-defined non-blocking collectives.
BTW .. I pushed this commit to mpich-ibm/ticket-1808
.
Originally by balaji on 2013-05-10 20:23:29 -0500
My understanding was that you wanted to use collectives that are making progress within PAMI. Thus I suggested that pamid does not have to make special progress for them and could ignore the progress poke. But maybe I'm not understanding the problem here.
Are you saying that without NBC collectives, there is no PAMI work function that is already being executed? Or is the concern that one additional work function needs to be called now? Wouldn't this work function be queued up only when a nonblocking collective is posted?
Btw, mpich-ibm/ticket-1808
doesn't contain any IBM changes. It points to mpich-ibm/master
.
Originally by blocksom on 2013-05-13 10:29:03 -0500
Replying to balaji:
Are you saying that without NBC collectives, there is no PAMI work function that is already being executed? Or is the concern that one additional work function needs to be called now? Wouldn't this work function be queued up only when a nonblocking collective is posted?
Correct .. normally there is no "persistent" PAMI work function to handle MPICH progress. And, yes, we could add/remove the work function if there is an active MPICH NBC.
I don't really think this is a big deal - adding and removing a work function - but Dave wanted me to open this ticket to discuss the NBC design to see if it should be changed at all.
Btw,
mpich-ibm/ticket-1808
doesn't contain any IBM changes. It points tompich-ibm/master
.
Hmm .. I must have pushed the wrong branch. I'll look into this.
Originally by blocksom on 2013-05-13 10:43:32 -0500
So .. the reason the branch I pushed was the same as the mpich/master
branch is because this commit is already in the mpich/master
branch.
$ git checkout master
Switched to branch 'master'
$ git log --oneline | grep "Enable MPIR_"
d1cfa4d Enable MPIR_* non-blocking collectives implementation
I think this ticket is just to discuss the design .. not necessarily to fix a problem.
Originally by blocksom on 2013-05-13 10:44:26 -0500
I deleted the mpich-ibm/ticket-1808
branch.
Originally by balaji on 2013-05-13 11:56:47 -0500
Thanks for the clarification, Mike. Unless there is a specific action item to be done, I'm not sure this ticket should be open. I'm resolving it, but feel free to open it if there's something we need to fix-up for this.
Originally by blocksom on 2013-05-13 14:04:31 -0500
Well .. now I'm not so sure this ticket should be closed.
I've been cleaning up the differences between the mpich/master
branch and the mpich-ibm/build
branch and there is a large amount of nbc-related code that was not integrated into mpich/master
. So, unfortunately, I suspect we've been discussing the wrong commit this whole time.
I recall now that Dave wanted to discuss this because I touched several src/mpi/coll/*.c
files and the src/include/mpiimpl.h
file and he wanted to review this code to see if there was a better way to implement these nbc "hooks".
I pushed the code to the mpich-ibm/ticket-1808
branch, although we could open a new ticket instead of re-using this one and I can push the code to a different branch.
FWIW, if the commits for tickets #1808 (this one) and #1847 are git cherry-pick
'd on to mpich/master
, then there should be zero differences from the mpich-ibm/build
branch.
Originally by balaji on 2013-05-13 14:28:08 -0500
Apart from the fact that this patch introduces some (simple to fix) C99-isms, I'm also having a bit of trouble understanding what exactly it's doing.
If we just take an example of Iallgather, you are preferring the device implementation of Iallgather_optimized to that of Iallgather. With the _optimized function, there is no schedule created, so someone else is handling the progress? How does progress occur for Iallgather in this case?
Originally by blocksom on 2013-05-13 14:45:24 -0500
The way pami works, for collectives and point-to-point and whatever, is that the "work" is implicitly added to a pami context as a result of a pami function call. Sometimes there is no persistence for a pami operation if, for example, network resources are immediately available and the operation can be kicked off immediately without the need to later on come back and query status.
Progress is achieved when a thread (application or timer or interrupt or bgq-commthread or ..) advances a pami context via PAMI_Context_advance()
. This is essentially the same idea as the MPIDU_Sched_progress (&made_progress)
function. The difference is that in the code prior to this commit a schedule had to be created even though pami didn't need it. It is assumed that requiring a schedule will result in a performance penalty - although this has never been tested.
Perhaps this is what you meant by your suggestion of a "dummy" schedule? If such a dummy schedule could be constructed quickly so that it would not impact performance, then maybe that's the best way to hook in the pami optimized collectives with your existing nbc framework. And we could also add the prior suggestion of not posting the pami work function if the schedule is a "dummy" schedule.
Originally by balaji on 2013-05-13 22:25:56 -0500
It sounds like my initial understanding of the problem was correct. However, the solution I proposed might need some changes. There are three models we want to allow for NBC:
I think I might have essentially described the solution that you already provided in your patch, though I might implement it slightly differently. But the idea would be the same -- you'll have two function pointers (_regular and _optimized, but probably with different names).
Did I capture all of your concerns? Does this sound reasonable?
Originally by blocksom on 2013-05-14 09:46:10 -0500
Replying to balaji:
- The device handles the entire collective operation. No schedule is created or exposed to the MPI layer, but entirely handled by the device. Sounds like this is what you need for pamid.
I think the pamid layer needs to be able to invoke both the _optimized and _regular collectives. Sometimes the default mpich collectives perform better than any pami collective for certain topologies or datatypes or whatever - so we still need to be able to select the mpich collectives in addition to the pami collectives.
Did I capture all of your concerns? Does this sound reasonable?
Sounds good to me. Go ahead and fix up the commit as you like, but please put your changes in a separate commit on top of the existing commit so that, after both commits are pushed to mpich/master
, the eventual merge with mpich-ibm/build
goes better.
Thanks!
Originally by balaji on 2013-05-14 09:49:05 -0500
Replying to blocksom:
I think the pamid layer needs to be able to invoke both the _optimized and _regular collectives. Sometimes the default mpich collectives perform better than any pami collective for certain topologies or datatypes or whatever - so we still need to be able to select the mpich collectives in addition to the pami collectives.
The _optimized and regular functions that I mentioned above are just function pointers. You can use any of the MPIR functions as needed, just like regular blocking collectives.
Sounds good to me. Go ahead and fix up the commit as you like, but please put your changes in a separate commit on top of the existing commit so that, after both commits are pushed to
mpich/master
, the eventual merge withmpich-ibm/build
goes better.
Ok, I'll do that and push to an -alt branch for review.
Originally by balaji on 2013-05-14 20:47:00 -0500
After going back-and-forth on this for a while, I finally came around to the same solution as what you guys proposed. I've made a new branch mpich-ibm/ticket-1808-alt
, which incorporates your proposed patch plus two additional patches -- one to change the naming convention making the purpose of the function pointers clearer, and another to remove the C99-ism in the code.
Please review these patches and let me know. I'll push them into mpich/master
.
Originally by blocksom on 2013-05-15 08:53:20 -0500
Looks good to me. I compiled and tested cpi on bgq.
Originally by balaji on 2013-05-15 12:03:22 -0500
These have been pushed into mpich/master
. Relevant commits: [2eefd3a2], [57f5b7b1], [49595414].
Originally by blocksom on 2013-05-15 12:10:42 -0500
I deleted the mpich-ibm/ticket-1808
and mpich-ibm/ticket-1808-alt
branches.
Originally by blocksom on 2013-04-12 09:26:28 -0500
The existing NBC design forces the ADI layer to invoke the nbc schedule state machine if any MPIR_* nbc functions are used.
Attached is a patch that enables MPIR* nbc for the pamid ADI, however the MPIR* nbc are disabled by default. This is done because if the application chooses an optimized pami collective the advance of the nbc scheduler state is unnecessary and only results in decreased performance of the pami optimized collective.
From the commit message in the patch:
This ticket is for discussion of the current MPIR_* nbc design.