openpmix / openpmix

OpenPMIx Project Repository
https://openpmix.org
Other
222 stars 115 forks source link

Problems with PMIx_Resolve_peers and spawned processes #3359

Open sonjahapp opened 4 months ago

sonjahapp commented 4 months ago

Background information

Hello again!

I found a few problems with the use of PMIx_Resolve_peers and spawned processes for different versions of OpenPMIx. I'm not sure if all these "problems" are actually bugs. Perhaps I am doing something wrong or something that is not supported by the PMIx standard. Any insights would be highly appreciated!

What version of the PMIx Reference Library are you using?

Describe how PMIx was installed

from a git clone

Please describe the system on which you are running


Details of the problem

General observations (independent of OpenPMIx version):

Observations for different OpenPMIx versions:

Case/Version 4.2.9 5.0.2 and master (85a7626d)
1 works works with PMIx process group over all procs

without PMIx process group: PMIX_ERR_NOT_FOUND returned for spawned nspace in original procs

Works for spawned procs
2 PMIX_ERR_NOT_FOUND for spawned nspace on original nodes

PRTE ERROR: Not found in file <...>/src/prted/pmix/pmix_server_fence.c at line 248 for spawned procs

might be partly a PRRTE issue?
PMIX_ERR_NOT_FOUND for
- spawned nspace in original procs on node with spawned procs and on own node
- original nspace in spawned procs on node with original procs and on own node

In essence: nspaces can't resolve each others peers

PMIx Process group makes no difference
3 hangs in original procs when resolving for spawned nspace (PMIX_ERR_TIMEOUT)

PRTE ERROR: Not found in file <...>/src/prted/pmix/pmix_server_fence.c at line 248 for spawned procs
See 5.0.2, case 2

I will add a reproducer for this asap (need to strip down my test program first to make it focus on this aspect ;-) )

Thanks a lot for the help!

rhc54 commented 4 months ago

Hmmm...we may have a problem with the standard here - you cannot pass a NULL for the nspace argument as it is a pmix_nspace_t (at least, in the current library), which is a fixed-size array. The compiler should have complained about it. We should probably change it to a const char* so we can support a NULL argument.

I'll take a peek at this shortly.

sonjahapp commented 4 months ago

Here comes a reproducer: node_map_test.zip

rhc54 commented 4 months ago

Hmmm...something isn't quite right with your list code - I'm getting strange behavior when accessing the length of the list. Looks like there is some memory corruption going on when running your case 1.

When I add a simple print to get_node_list:

    utarray_new(list, &ut_str_icd);
    list->i = 0;
    printf("[%u] LEN: %ld %ld\n", own_proc.rank, list->i, utarray_len(list));

I get this output:

[1] LEN: 105553116266496 0
[prterun-Ralphs-iMac-38140@1:0]: Init (round 1, pid = 38141)
[0] LEN: 105553116266496 0

I added it because I keep seeing output like this from your print statement:

[prterun-Ralphs-iMac-38140@2:0]: nodes: 30064771073, node list: Ralphs-iMac

which looks like i is not being initialized or is being overwritten. What I really don't understand is why I explicitly set i to 0, and yet when I immediately print it I get garbage. 🤷‍♂️ Very puzzling.

Still, I can see the behavior of not finding an nspace that shares the node, so I'll look into it.

sonjahapp commented 4 months ago

That's strange. The UT_array code is a slightly adapted version from the MPICH code base. I tried on different systems and I was not able to reproduce the memory issue anywhere... :thinking:

Even more strange is that you access the value list->i twice in the same printf, the first time it gives you garbage and the second time via utarray_len it works? Is this the same when you flip first and second, so put utarray_len at first and then access list->i directly?

sonjahapp commented 4 months ago

Have you tried with %u as output formatter for the list length instead of %ld to exclude the possibility that this is an output formatting issue? The datatype of list->i is unsigned after all. An in your example above, both ranks give the exact same garbage...

rhc54 commented 4 months ago

I can look later - it isn't really impacting me right now. I have an example in the PRRTE repo (dynamic.c) that executes the same logic, so I can just use it for now. It appears that the problem (at least for me) is that the hostname aliases for the spawned procs aren't being passed down to the parent procs, and so we aren't matching the node names (one is fully qualified and the other isn't). Will take a bit to track that down.

rhc54 commented 4 months ago

Just in the FWIW category: Have you looked at the PMIx_Argv_xxx functions? They were provided specifically for dealing with these comma-delimited lists (such as the "resolve" functions return) and are very simple to use. Would significantly reduce the complexity of the code you provided.

We also have the PMIx_Info_list_xxx functions for assembling arrays of pmix_info_t to pass into PMIx APIs - might be worth a look.

sonjahapp commented 4 months ago

It appears that the problem (at least for me) is that the hostname aliases for the spawned procs aren't being passed down to the parent procs, and so we aren't matching the node names (one is fully qualified and the other isn't). Will take a bit to track that down.

Let me know if there is something that I can do to help solving this issue.

Have you looked at the PMIx_Argv_xxx functions?

Thanks for the hint, I will have a look!

rhc54 commented 4 months ago

Let me know if there is something that I can do to help solving this issue.

I appreciate the offer, but I hesitate to point you at anything just yet. The hostname alias issue might just be something local to me and not the core problem. Just something I have to resolve (pardon the pun) before I can move forward, and I'm somewhat swamped at the moment.

rhc54 commented 2 days ago

@sonjahapp Sorry for the very long delay. Just in case you are still interested, figured I should let you know that I've started working on this again and am making a little progress. I found an error in your provided reproducer that I figured I should mention. In your "do_spawn" function, the process doing the spawn publishes the namespace of the spawned job and calls "commit". However, you never did a "fence" to share that value with the other procs in the parent job.

So when any of those other procs attempts to "get" the namespace of the child job (so they can attempt to "resolve" things about them), the "get" operation fails. I added a "fence" to fix the problem:

        /* spawn new procs */
        rc = PMIx_Spawn(NULL, 0, apps, 1, child_nspace);
        CHECK_PMIX_ERR(rc, "PMIx_Spawn", own_proc);
        PMIX_APP_FREE(apps, 1);

        /* parent puts new child nspace into KVS */
        pmix_value_t val;
        PMIX_VALUE_LOAD(&val, child_nspace, PMIX_STRING);
        rc = PMIx_Put(PMIX_GLOBAL, SPAWN_NSPACE, &val);
        CHECK_PMIX_ERR(rc, "PMIx_Put child namespace (parent)", own_proc);
        rc = PMIx_Commit();
        CHECK_PMIX_ERR(rc, "PMIx_Commit", own_proc);
    }

    /**********   NEW CODE   *************/
    // circulate the name of the child nspace
    PMIX_PROC_LOAD(&jobproc, own_proc.nspace, PMIX_RANK_WILDCARD);
    err = PMIx_Fence(&jobproc, 1, NULL, 0);
    CHECK_ERR(err);
   /*****************************************/

Doesn't resolve the issues mentioned above - just necessary for fixing one source of error.

I also noticed you didn't call PMIx_Connect between the spawning proc and the child processes, instead opting for a "fence" op between them. I added the option to use "connect" so we can see if there is a difference in behavior.

Let me know if you are still interested. I'll be working on it anyway, but won't bother you further if you are not.