Open wenduwan opened 1 year ago
Currently I'm working on a solution for OFI
@rhc54 I'm curious if there is a utility in pmix that provides the following information:
Given a GPU(or for that matter, any PCI device), find out which ranks are using it as an accelerator
This is an important piece in my thought experiment - if there are multiple NICs equally distant from the GPU, how to make a fair selection, e.g. round-robin.
If you want to associate the GPU with the NIC, the way we've been doing it is by writing a wrapper script which does the following
Let's take the following example on frontier like architecture: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html
Say you tell mpirun (or SLURM) to bind by L3cache. A process will be bound to core 01. The wrapper script should then pick GPU4 and restrict HIP_VISIBLE_DEVICES to GPU4 When the NIC distance selection algorithm runs when the MPI process starts it will pick hsn2
This should resolve the issue you're describing in this ticket, assuming your setup is the same.
I've thought about adding logic to do all this in Open MPI, but I don't think this will be the right place. Basically the issue is, you want to restrict the GPU devices before the MPI process starts so that the application itself will adhere to using the selected GPU as well.
I guess you can push the logic up to the launcher itself, but I think this might be an overkill, and I'm not sure if we can write logic which is generic for all configuration.
Would this solution work for you?
@amirshehataornl Thank you! I have read through the references, and I believe your proposal should work. In fact, that is how we work around the problem right now, e.g. we need to embed CUDA_VISIBLE_DEVICES
in a launcher script for each rank to force a desired NIC->GPU pairing.
I just posted a patch to introduce this logic to opal and I would really appreciate if you could take a look.
The idea is to take advantage of the new accelerator framework in opal, which exposes the underlying device PCI attributes. With this information it is possible to make open mpi select a nearby NIC.
The reason we want to make this selection happen in open mpi is program correctness by default(as described in this issue wrt GDR). I imagine that allowing open mpi to choose the right NIC (most of the time) should provide a better user experience.
That said, I also agree with your main idea - we should allow the user to force a different behavior if desired. For that I'm planning to introduce a companion flag(:sigh:) to completely turn off this new logic.
Did I miss anything?
@rhc54 I'm curious if there is a utility in pmix that provides the following information:
Given a GPU(or for that matter, any PCI device), find out which ranks are using it as an accelerator
This is an important piece in my thought experiment - if there are multiple NICs equally distant from the GPU, how to make a fair selection, e.g. round-robin.
Actually, I think the method used on Frontier is probably not a good idea, at least for the general case. Might happen to work in their environment, but not necessarily a generic solution.
In general, it is a bad idea to introduce wrapper scripts between the launcher and the app. You'll get a particularly negative reaction from the debugger community, but it also applies to generic situations. If your approach relies on a wrapper script, it is usually due to an architectural issue.
What @wenduwan is really asking strikes me as much simpler. Given that the MPI process has been assigned a specific GPU by some entity, then what you want is for that process to use a NIC that is on the same PCIe root as that GPU. PMIx can do this for you, but that only works if the launcher is aware of the GPU assignments. This may or may not be the case depending on the method used for that assignment.
Other option is for the MPI process to look at its GPU assignment, determine the PCIe root for that GPU, and then find the NIC(s) co-resident on that root. Takes a little HWLOC code, but nothing too bad.
Note that this relies on a process being assigned only one GPU - otherwise, the logic goes out the window.
We generally have used an alternative approach. The GPU-to-NIC relationship only makes sense if the proc is bound - otherwise, there is no clear reason to pick one NIC over another. So if we assume the proc is bound, then we can check the device distances of the assigned GPU vs the available NICs - pick the NIC with the corresponding distance and you should be good to go.
We use the same method for the multi-GPU scenario. If the GPU device distances aren't the same, then they reside on different PCIe roots. In that case, you want to select multiple NICs, one corresponding to each GPU. Message routing then must be done on a per GPU/NIC pair basis.
@rhc54 Thanks for your insights!
PMIx can do this for you, but that only works if the launcher is aware of the GPU assignments. This may or may not be the case depending on the method used for that assignment.
We had an internal discussion and the consensus says this won't be guaranteed. In the case of CUDA, there are a couple ways to select the device, e.g. environment variable, set cuda device API, etc. Instead we leaned on the accelerator framework to retrieve this information.
Note that this relies on a process being assigned only one GPU - otherwise, the logic goes out the window.
You are right on point. I only had single GPU in mind, which is the most common use case (at least) from what I have seen. Multi-GPU scenario is more complicated, and I don't have a solid idea to ensure a reasonable NIC selection, e.g. what if the user selected 2 GPUs far apart on PCI - I'm not sure OFI can handle this case gracefully right now.
The GPU-to-NIC relationship only makes sense if the proc is bound - otherwise, there is no clear reason to pick one NIC over another.
I actually disagree. Process binding is a secondary concern in this case. Here we are making a explicit tradeoff - we only make sure the NIC is close to the GPU, not necessarily to the cpuset. This is for correctness(see GDR requirement above).
PMIx can do this for you, but that only works if the launcher is aware of the GPU assignments. This may or may not be the case depending on the method used for that assignment.
We had an internal discussion and the consensus says this won't be guaranteed. In the case of CUDA, there are a couple ways to select the device, e.g. environment variable, set cuda device API, etc. Instead we leaned on the accelerator framework to retrieve this information.
I agree in principle. However, there is a move to make GPUs be scheduled resources - i.e., for the scheduler to assign the GPU to be used by each process. This actually makes sense as currently we see a growing number of conflicts in production (where multiple processes select the same device, and/or multiple applications do it - e.g., a user executing multiple mpirun
commands in parallel within the same allocation). In this emerging environment, the launcher is communicating the scheduler's directives to the processes - and so the launcher is aware of the assignments.
You'll be seeing more of that in the near future.
Multi-GPU scenario is more complicated, and I don't have a solid idea to ensure a reasonable NIC selection, e.g. what if the user selected 2 GPUs far apart on PCI - I'm not sure OFI can handle this case gracefully right now.
Agreed - however, in the scheduler-based environment, that won't happen. I agree that OFI doesn't handle this case today, though we are seeing the multi-GPU application becoming more commonplace. Suspect this is something that will need to be addressed in the not-too-distant future.
I actually disagree. Process binding is a secondary concern in this case. Here we are making a explicit tradeoff
I agree with your point. My point really was that your tradeoff may actually hurt overall performance. I agree about the correctness concern - but an alternative approach might be to simply disallow GDR when procs are unbound. Or maybe warn the user that their performance may suffer because of this behavior. I'm unsure of the right answer here - just noting that your tradeoff has consequences that are hidden by this change.
@rhc54, @wenduwan The issue I see with selecting the NIC based on the GPU independently, is that you’re inevitably going to make the CPU path less efficient.
This solution can either select the NIC based on the GPU or it can select it based on the cores the process is bound to. I think it makes more sense to base the selection on a common factor.
For example, if the launcher selects the cores the process is bound to, then both the GPU and the NIC should be based on the cores which the process is bound to. This is what the wrapper script does. It’s not merely a work around it is essential in properly distributing the MPI processes.
This approach will ensure GPU, CPU cores and NIC all have good affinity to each other. It will also avoid having to optimize one path at the expense of the other.
I would assume that some applications can do both GPU and System work loads. If NIC is selected independently it’s entirely possible to have one path optimal and one path less than optimal.
On the other hand if we have a common factor we base our selection on then the selection of all pieces can be made optimal. This would be best done in the launcher.
Independently selecting the NIC based on the GPU would only work if the application specifically selects the GPU it needs and then calls setDevice() on it.
In the general case I’m aware of, the application doesn’t care to select the GPU, it simply relies on the *_VISIBLE_DEVICES environment variable being set and then uses the GPU(s) set there.
For these applications the *_VISIBLE_DEVICES will need to be set by something other than the application. In our case it’ll be the wrapper script.
The question is this. In the absence of the wrapper script, who tells the MPI process which GPU to use? If the MPI process decides on its own, then there is no way to properly distribute the MPI processes among available GPUs. So there has to be some layer which makes that assignment, if it’s not the wrapper script, shouldn’t it be the launcher?
The proper solution in this case would seem to be to make the launcher decide which cores to bind the process to, then based on this it selects the nearest GPU. It then starts the MPI process. OFI component can then select the NIC nearest the cores the process is bound to. In this way cores, GPU and NIC are all affinitized.
This solution has the advantage of letting the user decide how to map the processes and all selected is based on the user input. It’s possible in the future to map by GPU.
Thoughts?
As I said in a prior comment, the wrapper should be avoided as it causes problems for debuggers and generally reflects poor architectural choices in the code itself. Frankly, the user generally doesn't know which GPU or NIC each process should use, and you really don't want to force users to get that technical in these matters.
I also noted the increasing trend towards making GPUs a scheduled resource. Providing an envar listing visible devices breaks when running multiple mpirun
executions in parallel, which is growing in popularity due to the move towards workflow-based computing. Thus, the schedulers are adding code to assign specific GPUs to processes as part of the scheduling algorithm to resolve the conflicts. This is not being done via envars, but rather the GPUs are scheduled and "locked" for use to the assigned procs, and then the overall assignment communicated via PMIx.
You really don't want the launcher making that decision - in the case of mpirun
, it lacks global knowledge of what other mpirun
invocations have done. You need something that has that global view of what all currently running jobs are doing, and (even better yet) what pending jobs want to do....which is why people are tasking the scheduler with making the assignments.
The launcher is tasked, however, with "locking" the GPU for use by the specific process(es) to which it has been assigned. This precludes someone calling setDevice
on a GPU they have not been scheduled to use. The launcher can also bind the process appropriately to the region containing its assigned GPUs, where appropriate - obviously, if the process has been assigned multiple GPUs, it may not be possible to bind it to a location that is local to all of them. Appropriate mapping options are under development to help with these scenarios. And of course the launcher provides the PMIx info to the client procs.
Mapping by GPU is something already under development, as is mapping by NIC in the case where multiple NICs are present and connected to different fabric partitions. Resolving GDR issues is something we have been looking at, but it is tricky due to the problems you have identified.
The folks I've been working with are focused on ensuring that the overall procedure results in a situation where the cores, GPU and NIC are all coordinated for each process - and that the resulting combination takes into account both current and pending resource utilization to avoid conflicts. I have no issue with doing something here as an interim solution, though I would encourage you to avoid locking yourselves into a corner so you can take advantage of what's coming - especially as it won't be that long before it starts to arrive.
FWIW, I've been tasked with starting to roll it out in PRRTE v4/PMIx v5 in the next few months (using a simulated scheduler in PRRTE until the production schedulers catch up). Anyone interested in playing with it and/or participating in the design or implementation is welcome to do so.
As I said in a prior comment, the wrapper should be avoided as it causes problems for debuggers and generally reflects poor architectural choices in the code itself. Frankly, the user generally doesn't know which GPU or NIC each process should use, and you really don't want to force users to get that technical in these matters.
I'm not disputing that the wrapper is not the best solution and does reflect a problem in the architecture. The point I was trying to make is that you need a higher level layer which allocates the resources based on a common factor, rather than independently binding the CPU core to the proc, and independently assigning the GPU to the proc and independently assigning the NIC to the proc. If you go with this solution then you're not able to optimally select both NIC and GPU.
What I'm advocating is the idea of using a common factor to select both NIC and GPU. For example, you can use the cores the process is bound to, to select both the GPU and the NIC. Which is in effect what the wrapper does. (again, I'm not saying the wrapper is the correct solution. I'm just saying it has the desired effect)
You really don't want the launcher making that decision - in the case of
mpirun
, it lacks global knowledge of what othermpirun
invocations have done. You need something that has that global view of what all currently running jobs are doing, and (even better yet) what pending jobs want to do....which is why people are tasking the scheduler with making the assignments.
I used the wrong terminology when I was suggesting the layer which should do that. I said launcher, but what I really meant is the scheduler. As you have noted the schedule is the best layer equipped to make all these decisions.
The folks I've been working with are focused on ensuring that the overall procedure results in a situation where the cores, GPU and NIC are all coordinated for each process - and that the resulting combination takes into account both current and pending resource utilization to avoid conflicts. I have no issue with doing something here as an interim solution, though I would encourage you to avoid locking yourselves into a corner so you can take advantage of what's coming - especially as it won't be that long before it starts to arrive.
If we 're going to stick with an interim solution, then I would say the wrapper solution (although not perfect), is sufficient until the scheduler is able to do the same job. I don't see a point of adding code in MPI which complicates the selection process and opens up an issue with making one path less optimal than the other.
Furthermore, the wrapper will be needed either way (at lease in our case, and I would imagine for other cases as well), to correctly assign a GPU to a process. Otherwise, how would a process select the GPU?
With the wrapper, as I have already mentioned, it would use the CPU cores assigned by the scheduler to select the GPU, and then the distance selection code in MPI OFI would select the NIC based on the distance form the same set of CPU cores assigned by the scheduler, leading to a coherent system, and one that avoids the GDR issue mentioned.
Until the scheduler solution arrives, this looks like the most sensible approach.
FWIW, I've been tasked with starting to roll it out in PRRTE v4/PMIx v5 in the next few months (using a simulated scheduler in PRRTE until the production schedulers catch up). Anyone interested in playing with it and/or participating in the design or implementation is welcome to do so.
I'd be interested. Any documentation to look at? Any meetings to attend?
@amirshehataornl @rhc54 I agree with most of your points. What I sense here is uncertainty of how GPU usage(or its own technology) will evolve in MPI. I have the same questions as well.
I also want to reiterate that my proposed solution will be accompanied with an opt-out flag for users to turn off the GPU-based NIC selection(aka default ON), so that when Ralph's changes arrive the application can easily switch over. With my limited knowledge, I imagine those changes need to overcome many a challenges, i.e. accelerator vendor-specific quirks, corner case user behaviors, etc.
As I mentioned earlier, we know our customers are using the same wrapper approach as Amir favors, which does work. But that is not a good sell for new comers to open mpi 5. Since the new release claims support for accelerator & GDR, we probably want it to do the "right" thing by default for the basic single GPU use case. Regarding the performance tradeoff, the user can take advantage of utilities e.g. numactl to force CPU/NIC/GPU bindings for maximum throughput. Again, we will offer the flag to turn off this behavior if not desired.
Maybe I'm missing something.
On our system, the schedule binds the process to a core (if so mapped) and then both the GPU and NIC are selected based on the closeness to that core. Most applications as far as I understand don't explicitly choose the GPU (or the NIC for that matter). In our case we rely on the HIP_VISIBLE_DEVICES (which would be set per process) to set the GPU using the setDevice()
API.
If HIP_VISIBLE_DEVICES is not set, then each process (in the OSU tests at least) would select a GPU based on the rank. This would not be optimal and buggy as you pointed out in GDR.
So is the goal here to have a patch which handles the case when *_VISIBLE_DEVICES is not set. The application would then select a GPU based on the rank (in essence it round robins) and you want to be able to optimize that path?
If so, my question is this, do most applications either use GPU or System memory? Or do most of them use both? reason I'm asking is if the latter, wouldn't this approach make those applications less efficient?
I'll discuss it internally here and see what others think.
(by the way I don't favor the wrapper approach, I'm just saying it seems to give us the benefit we're looking for, as a stop gap :) )
@amirshehataornl From what I know, either _VISIBLE_DEVICES or setDevice() should work in the same way(assuming single-GPU). We have an example for the p4 platform, which has 2 sockets, 4 EFA nics and 8 A100 GPUs. The simplified topology is:
Imagine an application to utilize 8 GPUs with 8 ranks, inside its launching script, we can use CUDA_VISIBLE_DEVICES(or the equivalent of setCudaDevice) based on the rank
#app.sh
#Spread out GPU
case ${OMPI_COMM_WORLD_LOCAL_RANK} in
0)
export CUDA_VISIBLE_DEVICES=0
;;
1)
export CUDA_VISIBLE_DEVICES=1
;;
2)
export CUDA_VISIBLE_DEVICES=2
;;
...
6)
export CUDA_VISIBLE_DEVICES=6
7)
export CUDA_VISIBLE_DEVICES=7
;;
# Run the actual app
Then we can launch the app
mpirun --map-by ppr:4:package ... app.sh
With the propose patch, we get the desired(also optimal for p4) pairing, i.e.
This sounds similar to what you proposed.
If so, my question is this, do most applications either use GPU or System memory? Or do most of them use both? reason I'm asking is if the latter, wouldn't this approach make those applications less efficient?
I had the same question. It obviously depends on the actual application, but we are foremost concerned with the uniform case, i.e. all mpi processes require and only require 1 GPU. And in this case, the above mapping does provide the optimal performance.
The example is just to demonstrate the NIC/GPU mapping, the application can handle that internally as well, similar to what OMB does, e.g.
mpirun --map-by ppr:4:package ... osu_mbw_mr # It produces the same bindings as above.
Sounds like you are making some significant assumptions about applications - perhaps representative of AWS workloads, but far from the typical case seen elsewhere. I'm not saying it is an invalid scenario, just that you are hardcoding a behavior that only works well in a very specific application type...and there are lots and lots of applications out there.
I believe what @amirshehataornl and I are trying to point out is that this isn't necessary nor desirable. If AWS has a specific use-case of concern, then the wrapper solution is the one you should pursue as you can customize it to fit your specific situation without negatively impacting others.
Down the road, we hopefully will see a more generalized solution become available that may make these wrappers (which we all agree are undesirable) unnecessary. Meantime, no reason to be coding things into OMPI that will cause broader problems.
Just my $0.02 - you folks can/should do what you want here. I'll continue working with folks on the generalized approach.
I'd be interested. Any documentation to look at? Any meetings to attend?
Let me check around - it's all being done rather quietly right now, and some proprietary-like concerns (in terms of competitiveness) need to be addressed. Stuff is being regularly exposed in the PMIx/PRRTE master branches as progress is made. Hopefully that will accelerate a bit over the summer.
Reading through everyone's opinions above, my impression is that none of us could predict how real workloads will use GPU, leading to the fear that a short-term fix will prove categorically wrong in the long term. I totally understand that.
That said, what we do know is that the application will hit issues with GDR when Open MPI 5 is released as-is. I wonder if and how we should advertise the mitigation to potential users? Should we document how to write the wrapper? I doubt if that will look good either.
Instead, maybe we can make my proposed solution an "opt-in" feature, i.e. expose an mca parameter to turn on the NIC selection logic, and advertise that to the public if the application wants to use e.g. GDR? That way once we have a more complete long-term solution we can internally override the logic and not worry about compatibility.
What do you think @rhc54 @amirshehataornl
Reading through everyone's opinions above, my impression is that none of us could predict how real workloads will use GPU, leading to the fear that a short-term fix will prove categorically wrong in the long term. I totally understand that.
Yeah, I think that's an accurate picture
Instead, maybe we can make my proposed solution an "opt-in" feature, i.e. expose an mca parameter to turn on the NIC selection logic, and advertise that to the public if the application wants to use e.g. GDR? That way once we have a more complete long-term solution we can internally override the logic and not worry about compatibility.
I honestly don't have an opinion on that option. My only concern was that the proposed solution appeared to be "opt-out", which meant that it would break people who don't fit that particular usage pattern. I don't think anyone really knows how many people fit into either category, but it just felt odd.
Counter is that someone might get the "wrong" NIC, compute for a long time (doing non-GDR stuff), and then run into trouble after having spent a bunch of time/money. Hence my suggestion of possibly disabling GDR if they don't opt-in as it feels like the safest alternative - and warning folks that GDR has been disabled, how to enable it if they care, how to turn off the warning if they don't care, etc.
Suppose you could follow that warning pattern either way - just a mechanism that we've used before.
opt-in sounds good to me.
And don't mean to beat a dead horse, but to be consistent can we use PMIx to implement this functionality? I was discussing the way we can do this with @rhc54 and it seems like we can do the following:
PMIx already calculates the distances from porc->gpu and proc->nic. So we can retrieve this information and
If GPU is set && GPU selection enabled find the closest proc->nic distance to the proc->gpu distance. This will give you the nearest NIC to the gpu else find_nearest_nic_to_proc()
There could be multiple NICs which has the same distance to the GPU, then you can round robin over them. Current logic should handle that.
The advantage of doing it this way is you can warn when selecting the nearest NIC to GPU will result in a less than optimal proc->NIC selection. Might be good info for the user to see, instead of having to chase their tail trying to understand why the system path is not performing as well as they think it should.
thoughts?
PMIx already calculates the distances from porc->gpu and proc->nic. So we can retrieve this information and If GPU is set && GPU selection enabled find the closest proc->nic distance to the proc->gpu distance. This will give you the nearest NIC to the gpu else find_nearest_nic_to_proc()
@amirshehataornl I'm not sure that alone will make GDR happy - the GPU and NIC have to (more/less) be behind the same PCIe switch. We need a direct distance between GPU and NIC instead. That's why I was probing if we already have something in PMIx - or if PMIx is the right component to do that.
Also regarding...
If so, my question is this, do most applications either use GPU or System memory? Or do most of them use both? reason I'm asking is if the latter, wouldn't this approach make those applications less efficient? I'll discuss it internally here and see what others think.
I'm curious if you have something to share?
I looked around the internet, and the public examples all seem to follow what I described earlier, e.g. Open MPI's own FAQ, gromacs, chlora, etc.
This leads me to think that the proposed approach(either opt-in or opt-out) is a good default/starting point.
@amirshehataornl I'm not sure that alone will make GDR happy - the GPU and NIC have to (more/less) be behind the same PCIe switch. We need a direct distance between GPU and NIC instead.
I don't believe that will do what you say you want either - you can have minimum distance between two objects and not be behind the same switch. Distance simply measures the number of steps to get to the root of the object, then the number of steps to get to the root of the other object, and then the number of steps down that bus to get to the other object. Nothing in that calculation says that the two objects must be on the same root.
Note that I can easily construct a system where the min distance between a particular GPU and NIC will be across two PCIe root switches. If you want to know which pairs share a switch, I don't see how you can rely solely on distance, whether direct or not.
What you seem to be saying is that you want something that gives you a list of the device IDs of all devices on a given PCIe switch so you can then search it and select a GPU/NIC pair that is on the same list. This is easily done by adding another attribute definition and having the PMIx server simply construct the lists while it is computing the device distances.
On the other hand, you kind of made that requirement a little squishy with your "more/less" comment. If the pair don't have to be behind the same switch, then the generalized distance calculation outlined by @amirshehataornl should be fine. If they do have to be behind the same switch, then the new list I described is the only real way to do it as distance won't really work in general.
I looked around the internet, and the public examples all seem to follow what I described earlier, e.g. Open MPI's own FAQ, gromacs, chlora, etc.
I cannot address what one might find with a simple Google search - the terms of the search, the specific search engine, etc all tend to make the results somewhat questionable. Reasonable indicator, yes - definitive answer, no (and I'm not saying you presented it as definitive).
There are quite a few research papers published on the use of multiple GPUs per MPI process. In fact, it is the basis of the design underlying the exascale computers, and forming the backbone of designs for the next generation systems. It is particularly common in the AI world - which obviously isn't the target you are describing, but nonetheless of increasing interest to OMPI consumers.
This leads me to think that the proposed approach(either opt-in or opt-out) is a good default/starting point.
I don't have an issue with that starting point, though I would personally recommend "opt-in" as this is something new and potentially problematic for some users.
I cannot address what one might find with a simple Google search - the terms of the search, the specific search engine, etc all tend to make the results somewhat questionable. Reasonable indicator, yes - definitive answer, no (and I'm not saying you presented it as definitive).
I haven't done hardly any research in the broader area so I will take your word for it 🙂
If they do have to be behind the same switch, then the new list I described is the only real way to do it as distance won't really work in general.
You are right about my solution not being technically correct. In fact I made an assumption about the hardware topology, i.e. all NIC and GPU are symmetric. This works for AWS, and you are 100% correct that this isn't universally true.
It's not hard to implement the correct checks, but it is likely going to be verbose. So far we are really concerned with a particular corner case, i.e. one vendor(Nvidia) and a handful device models(those with GDR capabilities). If we want to do it optimally, the solution might be (overly) pedantic and exhaustive(and takes to long to materialize). I wonder if we can make some simplifying assumptions and start from there. In that spirit, I want to be lazy up-front and make the opt-in feature to iterate over later.
I will incorporate your feedback and update the PR.
You are right about my solution not being technically correct. In fact I made an assumption about the hardware topology, i.e. all NIC and GPU are symmetric. This works for AWS, and you are 100% correct that this isn't universally true.
Do we need to make the assumption? We can certainly have PMIx provide the list, but one can also construct it from the hwloc topology directly in the case where PMIx doesn't provide it. Constructing the list is straightforward and perhaps simpler than the patch you provided - no distances are involved, you just look for a PCIe root and then traverse its children, and then move to the next PCIe root. You can count steps as you go down the bus if you want distance along the bus as well.
Issue might be - what do you do in the case where there is no GPU/NIC pair on the same PCIe root? However, I don't remember if your patch deals with that situation either - does it?
I would still recommend "opt-in" because it is a new feature and going to surprise some folks. How we compute the pairing is somewhat irrelevant to that decision.
Is your feature request related to a problem? Please describe. Main/v5.0.x branch does have a built-in mechanism to pair accelerator(let's say GPU) with a "nearby" NIC. This has 2 implications for Nvidia GPUDirect RDMA:
Undefined behavior due to data ordering. According to the same page, GDR requires that "...the two devices must share the same upstream PCI Express root complex. Some of the limitations depend on the platform used and could be lifted in current/future products". Naiively choosing a NIC on a different PCIe root complex from the GPU could in theory cause data ordering issue - imagine a workflow:
This results in undefined behavior - the CPU could process the completion before data arrives at GPU across PCIe rc.
Describe the solution you'd like At the minimum, we should correctly select the NIC with the shortest distance(measured in some way) from the user selected GPU. In Open MPI 5, we can take advantage of the accelerator framework(we just exposed the PCI attributes from get_device_pci_attr API):
Optionally, for GDR we need to double check that the selected NIC and GPU comply to the above requirement, and throw a error/warning otherwise.
Describe alternatives you've considered Currently the application must pin the GPU, e.g. set visible cuda device, for each rank according to the PCIe configuration.
Additional context Related: https://github.com/open-mpi/ompi/pull/11687