Open sonjahapp opened 4 months ago
Hmmm...we may have a problem with the standard here - you cannot pass a NULL
for the nspace argument as it is a pmix_nspace_t
(at least, in the current library), which is a fixed-size array. The compiler should have complained about it. We should probably change it to a const char*
so we can support a NULL
argument.
I'll take a peek at this shortly.
Here comes a reproducer: node_map_test.zip
Hmmm...something isn't quite right with your list code - I'm getting strange behavior when accessing the length of the list. Looks like there is some memory corruption going on when running your case 1.
When I add a simple print to get_node_list
:
utarray_new(list, &ut_str_icd);
list->i = 0;
printf("[%u] LEN: %ld %ld\n", own_proc.rank, list->i, utarray_len(list));
I get this output:
[1] LEN: 105553116266496 0
[prterun-Ralphs-iMac-38140@1:0]: Init (round 1, pid = 38141)
[0] LEN: 105553116266496 0
I added it because I keep seeing output like this from your print statement:
[prterun-Ralphs-iMac-38140@2:0]: nodes: 30064771073, node list: Ralphs-iMac
which looks like i
is not being initialized or is being overwritten. What I really don't understand is why I explicitly set i
to 0, and yet when I immediately print it I get garbage. 🤷♂️ Very puzzling.
Still, I can see the behavior of not finding an nspace that shares the node, so I'll look into it.
That's strange. The UT_array
code is a slightly adapted version from the MPICH code base. I tried on different systems and I was not able to reproduce the memory issue anywhere... :thinking:
Even more strange is that you access the value list->i
twice in the same printf
, the first time it gives you garbage and the second time via utarray_len
it works? Is this the same when you flip first and second, so put utarray_len
at first and then access list->i
directly?
Have you tried with %u
as output formatter for the list length instead of %ld
to exclude the possibility that this is an output formatting issue? The datatype of list->i
is unsigned
after all. An in your example above, both ranks give the exact same garbage...
I can look later - it isn't really impacting me right now. I have an example in the PRRTE repo (dynamic.c) that executes the same logic, so I can just use it for now. It appears that the problem (at least for me) is that the hostname aliases for the spawned procs aren't being passed down to the parent procs, and so we aren't matching the node names (one is fully qualified and the other isn't). Will take a bit to track that down.
Just in the FWIW category: Have you looked at the PMIx_Argv_xxx
functions? They were provided specifically for dealing with these comma-delimited lists (such as the "resolve" functions return) and are very simple to use. Would significantly reduce the complexity of the code you provided.
We also have the PMIx_Info_list_xxx
functions for assembling arrays of pmix_info_t
to pass into PMIx APIs - might be worth a look.
It appears that the problem (at least for me) is that the hostname aliases for the spawned procs aren't being passed down to the parent procs, and so we aren't matching the node names (one is fully qualified and the other isn't). Will take a bit to track that down.
Let me know if there is something that I can do to help solving this issue.
Have you looked at the
PMIx_Argv_xxx
functions?
Thanks for the hint, I will have a look!
Let me know if there is something that I can do to help solving this issue.
I appreciate the offer, but I hesitate to point you at anything just yet. The hostname alias issue might just be something local to me and not the core problem. Just something I have to resolve (pardon the pun) before I can move forward, and I'm somewhat swamped at the moment.
@sonjahapp Sorry for the very long delay. Just in case you are still interested, figured I should let you know that I've started working on this again and am making a little progress. I found an error in your provided reproducer that I figured I should mention. In your "do_spawn" function, the process doing the spawn publishes the namespace of the spawned job and calls "commit". However, you never did a "fence" to share that value with the other procs in the parent job.
So when any of those other procs attempts to "get" the namespace of the child job (so they can attempt to "resolve" things about them), the "get" operation fails. I added a "fence" to fix the problem:
/* spawn new procs */
rc = PMIx_Spawn(NULL, 0, apps, 1, child_nspace);
CHECK_PMIX_ERR(rc, "PMIx_Spawn", own_proc);
PMIX_APP_FREE(apps, 1);
/* parent puts new child nspace into KVS */
pmix_value_t val;
PMIX_VALUE_LOAD(&val, child_nspace, PMIX_STRING);
rc = PMIx_Put(PMIX_GLOBAL, SPAWN_NSPACE, &val);
CHECK_PMIX_ERR(rc, "PMIx_Put child namespace (parent)", own_proc);
rc = PMIx_Commit();
CHECK_PMIX_ERR(rc, "PMIx_Commit", own_proc);
}
/********** NEW CODE *************/
// circulate the name of the child nspace
PMIX_PROC_LOAD(&jobproc, own_proc.nspace, PMIX_RANK_WILDCARD);
err = PMIx_Fence(&jobproc, 1, NULL, 0);
CHECK_ERR(err);
/*****************************************/
Doesn't resolve the issues mentioned above - just necessary for fixing one source of error.
I also noticed you didn't call PMIx_Connect
between the spawning proc and the child processes, instead opting for a "fence" op between them. I added the option to use "connect" so we can see if there is a difference in behavior.
Let me know if you are still interested. I'll be working on it anyway, but won't bother you further if you are not.
Background information
Hello again!
I found a few problems with the use of
PMIx_Resolve_peers
and spawned processes for different versions of OpenPMIx. I'm not sure if all these "problems" are actually bugs. Perhaps I am doing something wrong or something that is not supported by the PMIx standard. Any insights would be highly appreciated!What version of the PMIx Reference Library are you using?
Describe how PMIx was installed
from a git clone
Please describe the system on which you are running
Details of the problem
General observations (independent of OpenPMIx version):
NULL
andnprocs == 0
if there are zero processes of a namespace running on a given node. However, currentlyPMIX_ERR_NOT_FOUND
is returned in this case which is somewhat confusing because this indicates (at least to me) that something went wrongnspace
argument of this function toNULL
should return all processes running on a node independent of their namespace; this does not work for cases where multiple nspaces are running on the same nodeObservations for different OpenPMIx versions:
testapp
spawns 2 processesprterun -np 2 --host=n01:4 ./testapp
prterun -np 2 --host=n01:2,n02:2 ./testapp
prterun -np 2 --host=n01:1,n02:3 ./testapp
without PMIx process group:
PMIX_ERR_NOT_FOUND
returned for spawned nspace in original procsWorks for spawned procs
PMIX_ERR_NOT_FOUND
for spawned nspace on original nodesPRTE ERROR: Not found in file <...>/src/prted/pmix/pmix_server_fence.c at line 248
for spawned procsmight be partly a PRRTE issue?
PMIX_ERR_NOT_FOUND
for- spawned nspace in original procs on node with spawned procs and on own node
- original nspace in spawned procs on node with original procs and on own node
In essence: nspaces can't resolve each others peers
PMIx Process group makes no difference
PMIX_ERR_TIMEOUT
)PRTE ERROR: Not found in file <...>/src/prted/pmix/pmix_server_fence.c at line 248
for spawned procsI will add a reproducer for this asap (need to strip down my test program first to make it focus on this aspect ;-) )
Thanks a lot for the help!