pamid adi MPIR_* non-blocking collectives implementation

mpichbot commented 7 years ago

Originally by blocksom on 2013-04-12 09:26:28 -0500

The existing NBC design forces the ADI layer to invoke the nbc schedule state machine if any MPIR_* nbc functions are used.

Attached is a patch that enables MPIR* nbc for the pamid ADI, however the MPIR* nbc are disabled by default. This is done because if the application chooses an optimized pami collective the advance of the nbc scheduler state is unnecessary and only results in decreased performance of the pami optimized collective.

From the commit message in the patch:

If the environment variable 'PAMID_MPIR_NBC' is set to non-zero then a
pami work function is posted to context 0 which will invoke the schedule
progress function.

By default, MPIR_* non-blocking collectives are disabled in order to
avoid impacting the performance of other MPI operations.

This ticket is for discussion of the current MPIR_* nbc design.

mpichbot commented 7 years ago

Originally by blocksom on 2013-04-12 09:26:43 -0500

Attachment added: nbc_collectives.patch (6.4 KiB)

mpichbot commented 7 years ago

Originally by balaji on 2013-05-09 21:18:51 -0500

Perhaps I'm not fully understanding the problem here. Why can't your optimized implementation return a dummy schedule and ignore poking progress for that schedule?

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-10 16:02:29 -0500

How does one "ignore poking progress for that schedule"?

When I last looked at the code the MPIDU_Sched_progress (&made_progress) had to be called during the progress loop or else the nbc schedules would not advance and the non-blocking collectives would hang. I had to put this function in a PAMI "work function" and queue it up so that it would be called when the PAMI contexts were advanced (via PAMI_Context_advance().

Obviously, this adds code in the critical path for pamid. So I only enqueue this work function to advance the nbc schedules if an environment variable is set.

I don't understand what a dummy schedule for pami-optimized collectives would accomplish since this MPIDU_Sched_progress (&made_progress) function would need to be invoked to cover the other mpich-defined non-blocking collectives.

BTW .. I pushed this commit to mpich-ibm/ticket-1808.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-10 20:23:29 -0500

My understanding was that you wanted to use collectives that are making progress within PAMI. Thus I suggested that pamid does not have to make special progress for them and could ignore the progress poke. But maybe I'm not understanding the problem here.

Are you saying that without NBC collectives, there is no PAMI work function that is already being executed? Or is the concern that one additional work function needs to be called now? Wouldn't this work function be queued up only when a nonblocking collective is posted?

Btw, mpich-ibm/ticket-1808 doesn't contain any IBM changes. It points to mpich-ibm/master.

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-13 10:29:03 -0500

Replying to balaji:

Are you saying that without NBC collectives, there is no PAMI work function that is already being executed? Or is the concern that one additional work function needs to be called now? Wouldn't this work function be queued up only when a nonblocking collective is posted?

Correct .. normally there is no "persistent" PAMI work function to handle MPICH progress. And, yes, we could add/remove the work function if there is an active MPICH NBC.

I don't really think this is a big deal - adding and removing a work function - but Dave wanted me to open this ticket to discuss the NBC design to see if it should be changed at all.

Btw, mpich-ibm/ticket-1808 doesn't contain any IBM changes. It points to mpich-ibm/master.

Hmm .. I must have pushed the wrong branch. I'll look into this.

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-13 10:43:32 -0500

So .. the reason the branch I pushed was the same as the mpich/master branch is because this commit is already in the mpich/master branch.

$ git checkout master
Switched to branch 'master'

$ git log --oneline | grep "Enable MPIR_"
d1cfa4d Enable MPIR_* non-blocking collectives implementation

I think this ticket is just to discuss the design .. not necessarily to fix a problem.

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-13 10:44:26 -0500

I deleted the mpich-ibm/ticket-1808 branch.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-13 11:56:47 -0500

Thanks for the clarification, Mike. Unless there is a specific action item to be done, I'm not sure this ticket should be open. I'm resolving it, but feel free to open it if there's something we need to fix-up for this.

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-13 14:04:31 -0500

Well .. now I'm not so sure this ticket should be closed.

I've been cleaning up the differences between the mpich/master branch and the mpich-ibm/build branch and there is a large amount of nbc-related code that was not integrated into mpich/master. So, unfortunately, I suspect we've been discussing the wrong commit this whole time.

I recall now that Dave wanted to discuss this because I touched several src/mpi/coll/*.c files and the src/include/mpiimpl.h file and he wanted to review this code to see if there was a better way to implement these nbc "hooks".

I pushed the code to the mpich-ibm/ticket-1808 branch, although we could open a new ticket instead of re-using this one and I can push the code to a different branch.

FWIW, if the commits for tickets #1808 (this one) and #1847 are git cherry-pick'd on to mpich/master, then there should be zero differences from the mpich-ibm/build branch.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-13 14:28:08 -0500

Apart from the fact that this patch introduces some (simple to fix) C99-isms, I'm also having a bit of trouble understanding what exactly it's doing.

If we just take an example of Iallgather, you are preferring the device implementation of Iallgather_optimized to that of Iallgather. With the _optimized function, there is no schedule created, so someone else is handling the progress? How does progress occur for Iallgather in this case?

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-13 14:45:24 -0500

The way pami works, for collectives and point-to-point and whatever, is that the "work" is implicitly added to a pami context as a result of a pami function call. Sometimes there is no persistence for a pami operation if, for example, network resources are immediately available and the operation can be kicked off immediately without the need to later on come back and query status.

Progress is achieved when a thread (application or timer or interrupt or bgq-commthread or ..) advances a pami context via PAMI_Context_advance(). This is essentially the same idea as the MPIDU_Sched_progress (&made_progress) function. The difference is that in the code prior to this commit a schedule had to be created even though pami didn't need it. It is assumed that requiring a schedule will result in a performance penalty - although this has never been tested.

Perhaps this is what you meant by your suggestion of a "dummy" schedule? If such a dummy schedule could be constructed quickly so that it would not impact performance, then maybe that's the best way to hook in the pami optimized collectives with your existing nbc framework. And we could also add the prior suggestion of not posting the pami work function if the schedule is a "dummy" schedule.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-13 22:25:56 -0500

It sounds like my initial understanding of the problem was correct. However, the solution I proposed might need some changes. There are three models we want to allow for NBC:

The MPI layer handles the entire collective schedule. It makes its own progress. Device ignores the collective. This model is somewhat setup in the current code, in that the different schedules are implemented. But the MPI layer expects the device to always override the function pointer, even though it might just point to the MPI-level implementation of the function. This probably needs to be cleaned up a bit.
The device creates the schedule, but the MPI layer handles the actual progress. This model also seems to be currently present. The device can choose to just use the "default" schedule provided by the MPI layer, which falls back to option 1.
The device handles the entire collective operation. No schedule is created or exposed to the MPI layer, but entirely handled by the device. Sounds like this is what you need for pamid.

I think I might have essentially described the solution that you already provided in your patch, though I might implement it slightly differently. But the idea would be the same -- you'll have two function pointers (_regular and _optimized, but probably with different names).

Did I capture all of your concerns? Does this sound reasonable?

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-14 09:46:10 -0500

Replying to balaji:

The device handles the entire collective operation. No schedule is created or exposed to the MPI layer, but entirely handled by the device. Sounds like this is what you need for pamid.

I think the pamid layer needs to be able to invoke both the _optimized and _regular collectives. Sometimes the default mpich collectives perform better than any pami collective for certain topologies or datatypes or whatever - so we still need to be able to select the mpich collectives in addition to the pami collectives.

Did I capture all of your concerns? Does this sound reasonable?

Sounds good to me. Go ahead and fix up the commit as you like, but please put your changes in a separate commit on top of the existing commit so that, after both commits are pushed to mpich/master, the eventual merge with mpich-ibm/build goes better.

Thanks!

mpichbot commented 7 years ago

Originally by balaji on 2013-05-14 09:49:05 -0500

Replying to blocksom:

I think the pamid layer needs to be able to invoke both the _optimized and _regular collectives. Sometimes the default mpich collectives perform better than any pami collective for certain topologies or datatypes or whatever - so we still need to be able to select the mpich collectives in addition to the pami collectives.

The _optimized and regular functions that I mentioned above are just function pointers. You can use any of the MPIR functions as needed, just like regular blocking collectives.

Sounds good to me. Go ahead and fix up the commit as you like, but please put your changes in a separate commit on top of the existing commit so that, after both commits are pushed to mpich/master, the eventual merge with mpich-ibm/build goes better.

Ok, I'll do that and push to an -alt branch for review.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-14 20:47:00 -0500

After going back-and-forth on this for a while, I finally came around to the same solution as what you guys proposed. I've made a new branch mpich-ibm/ticket-1808-alt, which incorporates your proposed patch plus two additional patches -- one to change the naming convention making the purpose of the function pointers clearer, and another to remove the C99-ism in the code.

Please review these patches and let me know. I'll push them into mpich/master.

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-15 08:53:20 -0500

Looks good to me. I compiled and tested cpi on bgq.

mpichbot commented 7 years ago

Originally by balaji on 2013-05-15 12:03:22 -0500

These have been pushed into mpich/master. Relevant commits: [2eefd3a2], [57f5b7b1], [49595414].

mpichbot commented 7 years ago

Originally by blocksom on 2013-05-15 12:10:42 -0500

I deleted the mpich-ibm/ticket-1808 and mpich-ibm/ticket-1808-alt branches.

pmodels / mpich

pamid adi MPIR_* non-blocking collectives implementation #1808