open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

orte_ras_base_node_insert(): loss of slots on HNP #167

Closed ompiteam closed 7 years ago

ompiteam commented 10 years ago

For RAS modules that produce multiple orte_node_t records matching the HNP, the HNP record will be overridden with ONLY the slot count of the final orte_node_t record in the "nodes" list produced by the RAS module. E.g. this came up when Grid Engine produced the following PE_HOSTFILE:

node01-53 6 all.q@node01-53 <NULL>
node01-53 10 distrib.q@node01-53 <NULL>

where node01-53 is the HNP for the job and Grid Engine's allocation across multiple queues on node01-53 is the cause for two lines instead of just one. As far back as Open MPI 1.4.2 this issue exists in orte_ras_base_node_insert(); in the 1.6.1 source see orte/mca/ras/base/ras_base_node.c:99.

Since all RAS modules can effectively produce multiple orte_node_t records with the same node name, it seems logical to fix this in orte_ras_base_node_insert().

ompiteam commented 10 years ago

Imported from trac issue 3429. Created by Freyguy19713 on 2012-12-14T10:05:35, last modified: 2012-12-14T12:44:16

ompiteam commented 10 years ago

Trac comment by rhc on 2012-12-14 10:13:57:

We have had this issue of multiple queues arise in the past with Gridengine - the problem is: which entry do we believe/use? IIRC, we cannot launch a single job across multiple queues, so we can't use both allocations.

We haven't come up with a solution yet. Perhaps Rayson has some suggestions since this keeps being raised?

ompiteam commented 10 years ago

Trac comment by Freyguy19713 on 2012-12-14 10:25:42:

Since the exact queue configuration will be site and condition dependent, it is best to discard the assumption that multiple queues is invalid. The very fact that you've had multiple Grid Engine sites report this issue alone indicates it is more common place than you may expect.

Consider the following PE_HOSTFILE:

  node01-53 6 all.q@node01-53 <NULL>
  node01-53 10 distrib.q@node01-53 <NULL>
  node01-55 10 distrib.q@node01-55 <NULL>
  node01-55 6 all.q@node01-55 <NULL>

if node01-53 is the HNP then BOTH node01-55 records would get added to the global node list (since unique names do not appear to be enforced in the non-HNP section of the function), while just the first node01-53 record would effectively be discarded. So the behavior is not consistent, it only affects the assignment of slots to the HNP.

One solution is to modify this behavior in the GridEngine RAS code and add an MCA flag to allow it to be turned off. But it seems like in theory this must be a more general problem across all RAS modules, which would lead to the fix's being applied in ras_base_node.c, again with an MCA flag to disable it.

ompiteam commented 10 years ago

Trac comment by rhc on 2012-12-14 10:33:57:

I understand the situation - I should have been clearer in my response. We know this is a general issue with GE. In the past, the solution was that you had to request your GE allocation from a single queue. There is a flag in the allocation cmd for that purpose. Not specifying the flag allows GE to spread the allocation across multiple queues, thus causing the problem.

Rayson is a GE developer, so perhaps he has some idea if there is a better way we can solve this integration problem, or if the past solution is the only feasible one.

ompiteam commented 10 years ago

Trac comment by Freyguy19713 on 2012-12-14 10:46:54:

IMHO this is an issue of consistency in Open MPI, not a problem with Grid Engine and its queue assignments. As an extension to my PE_HOSTFILE example above and its inconsistent treatment of the HNP versus other nodes listed, consider the following Open MPI hostfile:

node01-28 slots=6
node01-28 slots=10

When I use this hostfile with the following command on node01-28 itself e.g.

mpirun --display-allocation --display-map --hostfile $HOSTFILE ./tst

the runtime looks like this:

======================   ALLOCATED NODES   ======================

 Data for node: Name: node01-28 Num slots: 16   Max slots: 0

=================================================================

 ========================   JOB MAP   ========================

 Data for node: Name: node01-28 Num procs: 16
        Process OMPI jobid: [48052,1] Process rank: 0
        Process OMPI jobid: [48052,1] Process rank: 1
        Process OMPI jobid: [48052,1] Process rank: 2
        Process OMPI jobid: [48052,1] Process rank: 3
        Process OMPI jobid: [48052,1] Process rank: 4
        Process OMPI jobid: [48052,1] Process rank: 5
        Process OMPI jobid: [48052,1] Process rank: 6
        Process OMPI jobid: [48052,1] Process rank: 7
        Process OMPI jobid: [48052,1] Process rank: 8
        Process OMPI jobid: [48052,1] Process rank: 9
        Process OMPI jobid: [48052,1] Process rank: 10
        Process OMPI jobid: [48052,1] Process rank: 11
        Process OMPI jobid: [48052,1] Process rank: 12
        Process OMPI jobid: [48052,1] Process rank: 13
        Process OMPI jobid: [48052,1] Process rank: 14
        Process OMPI jobid: [48052,1] Process rank: 15

 =============================================================

The hostfile I constructed here is really no different than Grid Engine's handing Open MPI a file containing multiple queues on the HNP, yet the two produce completely different runtime environments.

ompiteam commented 10 years ago

Trac comment by rhc on 2012-12-14 11:31:46:

I guess there are two issues here: summing HNP slots and what to do with multiple queues in GE. The former is indeed an error and I can correct it. What to do about the latter remains unclear to me.

ompiteam commented 10 years ago

Trac comment by rhc on 2012-12-14 12:00:45:

(In [27673]) Refs https://svn.open-mpi.org/trac/ompi/ticket/3429

Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node.

ompiteam commented 10 years ago

Trac comment by Freyguy19713 on 2012-12-14 12:44:16:

Replying to [comment:6 rhc]:

(In [27673]) Refs https://svn.open-mpi.org/trac/ompi/ticket/3429

Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node.

I'm not so sure that's the best fix; it ignores any possible difference in the slots_max or launch_id fields and still overwrites them. Some of the other RAS modules (e.g. ccp, loadleveler, tm) include their own checks for duplicate nodes. Fixing at the level of the individual modules prevents a fix in ras_base_node.c from altering the overall RAS behavior. That's why I ended up uploading a patched ras_gridengine_module.c on this ticket rather than patching ras_base_node.c.

rhc54 commented 7 years ago

Fixed long ago