nerscadmin / IPM

Integrated Performance Monitoring for High Performance Computing
http://ipm-hpc.org
GNU Lesser General Public License v2.1
81 stars 35 forks source link

Bug in IPM topology and subcommunicators #13

Closed brianmaustin closed 7 years ago

brianmaustin commented 8 years ago

When IPM_LOG=full, IPM will record the source and destination of each message. The source rank is relative to MPI_COMM_WORLD. The destination rank is relative to the communicator used to send the message. Any topology maps based on this data will be incorrect. For example, the spyplot for GTC should be a banded diagonal matrix, but instead, all of the P2P messages appear in the first block column.

Unfortunately, I don't know of a simple way to fix this. It would require mapping the dest rank on a subcommunicator to the rank of the same process on COMM_WORLD. I'm not aware of any user-level functions in MPI that would do this.

nerscadmin commented 8 years ago

Without diving too deeply into this, the overall (long-ago) philosophy about sub-communicators was to flatten then into COMM_WORLD. Keeping common meaning of rank across all operations was the main idea. Can you recover the destination rank through some translation? Calling MPI_Comm_rank is "low" on the performance overhead scale, but there may be a more direct way.

swfrench commented 8 years ago

I think what we're looking for in this case is MPI_Group_translate_ranks. This could be used to map the source or destination rank back to MPI_COMM_WORLD.

Not sure what the performance implications of calling MPI_Group_translate_ranks regularly might be, so you'd probably want a fast path around it (requires a light-weight test for whether we're using MPI_COMM_WORLD - not sure off hand if MPI_Comm_compare fits the bill in that regard, but it might be all we have that's not implementation-specific). Additional overhead from calling MPI_Comm_group on the slow path should be cheap I'd bet.

deskinner commented 8 years ago

Hi Scott,

It'd be useful to benchmark group_translate. If it's lightweight then it could be a good solution.

-David

On Sat, Apr 9, 2016 at 5:26 PM, Scott French notifications@github.com wrote:

I think what we're looking for in this case is MPI_Group_translate_ranks. This could be used to map the source or destination rank back to MPI_COMM_WORLD.

Not sure what the performance implications of calling MPI_Group_translate_ranks regularly might be, so you'd probably want a fast path around it (requires a light-weight test for whether we're using MPI_COMM_WORLD - not sure off hand if MPI_Comm_compare fits the bill in that regard, but it might be all we have that's not implementation-specific). Additional overhead from calling MPI_Comm_group on the slow path should be cheap I'd bet.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-207890500

swfrench commented 8 years ago

Thanks, David - indeed, it could very well be quite cheap.

In the meantime, I thought about this a bit this afternoon, and I think there might be some issues for non-blocking communication...

Example: We can easily infer the rank of a peer in MPI_COMM_WORLD when intercepting a call to MPI_Isend or MPI_Irecv. However, the same is not true for the subsequent MPI_Wait. The peer rank is available through MPI_Status, but the relevant communicator under which it defined is not (without non-portable poking around in the MPI_Request object, or doing something fancy in the IPM core library to track communicators for outstanding request objects).

Of course, I might be missing something obvious... If anyone has any suggestions, I'm open to them.

nerscadmin commented 8 years ago

I should have been more clear about lightweight. If the code is local and translate ranks does no off-node talking thanks one thing. If if does communication to resolve the answer it's another. Hoping for the former. On Apr 9, 2016 7:46 PM, "Scott French" notifications@github.com wrote:

Thanks, David - indeed, it could very well be quite cheap.

In the meantime, I thought about this a bit this afternoon, and I think there might be some issues for non-blocking communication...

Example: We can easily infer the rank of a peer in MPI_COMM_WORLD when intercepting a call to MPI_Isend or MPI_Irecv. However, the same is not true for the subsequent MPI_Wait. The peer rank is available through MPI_Status, but the relevant communicator under which it defined is not (without non-portable poking around in the MPI_Request object, or doing something fancy in the IPM core library to track communicators for outstanding request objects).

Of course, I might be missing something obvious... If anyone has any suggestions, I'm open to them.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-207902630

swfrench commented 8 years ago

Thanks for the follow-up, David. Indeed, this is definitely something we would want to avoid. Much faster, but still less than ideal, is if it does some sort of O(P) lookup.

One would hopefully implement MPI_Group_translate_ranks to stay completely on-node and use something O(1) to represent the mapping, but who knows... Your earlier suggestion of benchmarking stands as the best course of action =)

swfrench commented 8 years ago

Okay, some more thoughts:

It looks like MPICH does indeed fall back to an O(P) local lookup in MPI_Group_translate_ranks in the general case. That said, we can avoid calling MPI_Group_translate_ranks entirely in the (extremely common) case of MPI_COMM_WORLD.

Except in the case where the peer is MPI_ANY_SOURCE, we get correct peer ranks (as we have the communicator available to us) and the tx / rx byte count from the original invocation of the non-blocking operation. Thus, we could for the most part ignore the matching MPI_Wait w.r.t. computing adjacency. Indeed, it looks like we already do this for MPI_Waitany and co. We still, of course, log the number of calls to, elapsed time in, etc. for MPI_Wait.

The tradeoff in this case is that for applications that make extensive use of MPI_ANY_SOURCE and non-blocking calls, we won't record those edges in the adjacency graph. That said, the matching call on the peer rank we will likely catch (i.e. same edge, opposite direction).

@brianmaustin - Do you think this would be alright for your use case? If so, could you please give the HEAD commit of subcomm-ranks-issue branch a try?

brianmaustin commented 8 years ago

There has been a parallel private thread about this among Nick, Glenn, Doug and I at NERSC. I'll summarize it here, although Scott's fix may have already addressed this.

In that thread, Nick noted that

Ipm v 0.983 - which was the last version before the rewrite does this correctly. If you pull that from source forge

http://ipm-hpc.sourceforge.net/

You should be able to see the logic.

define IPM_MPI_GETGLOBALRANK_C {\

PMPI_Comm_compare(comm, MPI_COMM_WORLD, &ntypes); \ if(ntypes == MPI_IDENT || ntypes == MPI_CONGRUENT) { \ iirank = irank;\ } else { \ if(irank >= 0) { \ iirank = irank;\ PMPI_Comm_group(comm,&igroup); \ PMPI_Group_translate_ranks(igroup, 1, &iirank, ipm_world_group, &irank); \ /* printf("global %d %d\n", irank, iirank); */\ }\ }\ }

its in ipm_mpi.h

Nick does not remember noticing any performance costs for doing the translation this way, but agrees that it would be good to verify. He also pointed out the call to Comm_compare, which may help reduce the overhead.

I also suggested that we could do the translations ourselves by calling MPI_Comm_rank( MPI_COMM_WORLD, &world_rank ); MPI_Allgather( &world_rank, 1, MPI_INT, world_rank_list, 1, MPI_INT, new_comm ); whenever a new comm is created. After that, translating by world_rank_list would be essentially free. This would also allow us to see which processes were involved in subcommunicator collectives.

@swfrench - I think that what you proposed is sufficient for my needs... at the moment, I'm only looking at p2p data volumes, which I can get from the send data alone. I'll test your fix now.

-Brian

On Sun, Apr 10, 2016 at 6:21 PM, Scott French notifications@github.com wrote:

Okay, some more thoughts:

It looks like MPICH does indeed fall back to an O(P) local lookup in MPI_Group_translate_ranks in the general case. That said, we can avoid calling MPI_Group_translate_ranks entirely in the (extremely common) case of MPI_COMM_WORLD.

Except in the case where the peer is MPI_ANY_SOURCE, we get correct peer ranks (as we have the communicator available to us) and the tx / rx byte count from the original invocation of the non-blocking operation. Thus, we could for the most part ignore the matching MPI_Wait w.r.t. computing adjacency. Indeed, it looks like we already do this for MPI_Waitany and co. We still, of course, log the number of calls to, elapsed time in, etc. for MPI_Wait.

The tradeoff in this case is that for applications that make extensive use of MPI_ANY_SOURCE and non-blocking calls, we won't record those edges in the adjacency graph. That said, the matching call on the peer rank we will likely catch (i.e. same edge, opposite direction).

@brianmaustin https://github.com/brianmaustin - Do you think this would be alright for your use case? If so, could you please give the HEAD commit of subcomm-ranks-issue branch a try?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-208108325

swfrench commented 8 years ago

Thanks for looping me in, Brian. Good to see that something similar to the rough version in subcomm-ranks-issue was in IPM prior to the rewrite.

Let me know if that seems to do the trick, and I'll clean it up for "real" use.

Good idea also re: "eagerly" computing and storing the mapping between communicators at creation, and then using the latter to translate in subsequent communication calls (though, I think one would need to invert the mapping created in your example, and figure out a way to key it on the relevant communicator).

This could be handy if MPI_Group_translate_ranks turns out to be unexpectedly expensive.

brianmaustin commented 8 years ago

Hi Scott, The spyplots that I generate from map_calls.txt look more reasonable now. They aren't quite what I expected for GTC, and I'm looking for how to reconcile them. The one clear problem I've identified so far is that the map_adjacency.txt file has a header, but is otherwise empty. -Brian

On Mon, Apr 11, 2016 at 8:12 PM, Scott French notifications@github.com wrote:

Thanks for looping me in, Brian. Good to see that something similar to the rough version in subcomm-ranks-issue was in IPM prior to the rewrite.

Let me know if that seems to do the trick, and I'll clean it up for "real" use.

Good idea also re: "eagerly" computing and storing the mapping between communicators at creation, and then using the latter to translate in subsequent communication calls (though, I think one would need to invert the mapping created in your example, and figure out a way to key it on the relevant communicator).

This could be handy if MPI_Group_translate_ranks turns out to be unexpectedly expensive.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-208684671

brianmaustin commented 8 years ago

I figured out that my previous experiment was with the master branch. When I use the subcomm-ranks-issue branch, the spyplot (which I generate from the Send-like calls in map_calls.txt) is correct.

However, the map_adjacency.txt file is still empty. Is that a bug, or am I missing an option that would enable it? I'm running with IPM_LOG=full IPM_REPORT=full, and parsing with ipm_parse -full -html -force-topology

-Brian

On Tue, Apr 12, 2016 at 9:11 AM, Brian Austin baustin@lbl.gov wrote:

Hi Scott, The spyplots that I generate from map_calls.txt look more reasonable now. They aren't quite what I expected for GTC, and I'm looking for how to reconcile them. The one clear problem I've identified so far is that the map_adjacency.txt file has a header, but is otherwise empty. -Brian

On Mon, Apr 11, 2016 at 8:12 PM, Scott French notifications@github.com wrote:

Thanks for looping me in, Brian. Good to see that something similar to the rough version in subcomm-ranks-issue was in IPM prior to the rewrite.

Let me know if that seems to do the trick, and I'll clean it up for "real" use.

Good idea also re: "eagerly" computing and storing the mapping between communicators at creation, and then using the latter to translate in subsequent communication calls (though, I think one would need to invert the mapping created in your example, and figure out a way to key it on the relevant communicator).

This could be handy if MPI_Group_translate_ranks turns out to be unexpectedly expensive.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-208684671

swfrench commented 8 years ago

I think this will happen when ipm_parse cannot find the corresponding keys file. Try setting IPM_KEYFILE to point to ${install_prefix}/etc/ipm_key_mpi.

brianmaustin commented 8 years ago

Sure enough. Thanks Scott! -Brian

On Tue, Apr 12, 2016 at 7:42 PM, Scott French notifications@github.com wrote:

I think this will happen when ipm_parse cannot find the corresponding keys file. Try setting IPM_KEYFILE to point to ${install_prefix}/etc/ipm_key_mpi.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/nerscadmin/IPM/issues/13#issuecomment-209200846

swfrench commented 8 years ago

Glad to hear that did it, @brianmaustin!

As a result of your GTC runs instrumented both with master and subcomm-ranks-issue, do you happen to have wallclock times or some other suitable performance measure to compare the two? I'm interested in seeing whether you can detect any significant difference with the additional rank remapping overhead.

swfrench commented 8 years ago

Gentle ping @brianmaustin - Did you happen to have a chance to look at relative performance impact of using the subcomm-ranks-issue branch?

swfrench commented 8 years ago

I had a chance to run some very simple benchmarks on this branch (halo exchange with blocking point-to-point operations). This was done with two different communicators: MPI_COMM_WORLD and a re-mapped version thereof with a modular shift in rank (so communicating peers remain the same, but the communicator is sufficiently "different" to exercise the slow path in IPM_MPI_MAP_RANK).

  1. MPI_COMM_WORLD case: IPM-induced overhead relative to an uninstrumented binary does not appear to vary significantly between the subcomm-ranks-issue branch and master.
  2. Re-mapped communicator case: instrumentation overhead for the subcomm-ranks-issue branch is larger than master, but not by much. For example, a binary instrumented with master might experience an average slowdown per MPI call on the order of ~15%, while instrumenting with subcomm-ranks-issue (again we're triggering the slow path in IPM_MPI_MAP_RANK) might incur an addition 2-3%.

Of course, the relative slowdown might be larger in the case of a fully random remapping (i.e. if an MPI implementation chooses to implement PMPI_Group_translate_ranks without a constant-time lookup). Whether we should worry about this case, I am not sure.

Since (1) represents the common case, but appears not to be impacted performance-wise, and (2) is impacted only slightly but fixes the issue with incorrect topology, I think it would be worth adopting this change.

Unless anyone objects, I'll plan to merge this into master soon. A change like this might also warrant a minor version bump.

swfrench commented 7 years ago

Just realized I never actually merged this fix. Will do so now and will also tag that 2.0.6.

swfrench commented 7 years ago

Done. Closing.