Closed rhc54 closed 6 years ago
Just to further clarify: this problem will occur regardless of the plm launcher being used (e.g., when using mpirun under slurm) as we always try to pass the regex to the backend daemons on launch, and there is no guarantee that the length restriction will save us like it did in the referenced case.
@rhc54 i am a bit puzzled with this ...
i renamed my virtual cluster n-0
, n-1
and so on, but was unable to reproduce the hang (this is not a large cluster, so the regex length is small enough.
there is a more obvious error when the regex is too long, and this does not involve -
in the regex.
that can be simply reproduced with the inline patch below, and i can reproduce the issue when running
mpirun -npernode 1 -hostfile m -mca plm rsh -mca oob tcp -mca routed_radix 1 --mca plm_base_verbose 10 hostname
from n0
and my hostfile m
contains
n1
n2
at first glance, i did not see n1
perform the phone home operation, but if you can point me to there, i will be happy to debug it.
diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..54930db 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1573,6 +1573,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
ORTE_ERROR_LOG(rc);
return rc;
}
+#if 0
/* if this is too long, then we'll have to do it with
* a phone home operation instead */
if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
@@ -1582,6 +1583,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
/* mark that the nidmap has been communicated */
orte_nidmap_communicated = true;
}
+#endif
free(param);
if (!orte_static_ports && !orte_fwd_mpirun_port) {
If you only have two nodes in the hostfile, such as the Mellanox folks had, then the regex cannot be more than 1024 characters in length (which is the cmd_length limit). We run over 20k nodes with it and have never hit the limit.
I would look at the regex generator and see what it is doing wrong in that scenario.
One place to start might be to simply print out the regex and see why it is so long...
ok, let me put it this way then,
i created a cluster whose nodes are called frontend
, node-001
and node-002
and i am unable to reproduce the issue @karasevb is facing.
the regex (that can be seen with --mca plm_base_verbose 10
) is a simple expansion of the host list, in this case it is frontend,node-001,node-002@0(3)
.
should we fix the real issue (e.g. the phone home does not work) before optimizing the regex generation ?
I honestly don't think that is the real
issue here. The regex isn't long enough to be getting rejected, so the question is: why is it failing for Mellanox when it doesn't fail for you?
Once we understand and fix that, then sure - we can look into why the alternative code path isn't working. I honestly don't remember the code path for that case, so it would take a little looking to see how it is supposed to work.
@rhc54 i am fine with that.
@karasevb is Open MPI installed on NFS ? if not, can you please double check the very same Open MPI is deployed on all the nodes ? i have a real hard time trying to reproduce the issue, and all i can think of is (as @rhc54 previously suggested it) some node is running an older version of v3.0.x
without the fix.
Another thing we could do is provide a patch for orte/util/regex.c that adds some debug and have @karasevb run it to see what is going on over there.
@ggouaillardet Actually, I just realized that your output indeed shows the problem. The regex should have compressed node-001 and node-002 into a regex like node-[3:1-2]. The hyphen is confusing it into thinking every node name is unique and cannot be compressed.
This is why Mellanox is failing. They have a lot of nodes (we don't filter them when making the original regex), and the regex generator thinks they are all unique names. Otherwise, it would have output something like "node-[3:1-999]".
@rhc54 minor detail, the regex should be node-00[3:1-2]
about the "lot of nodes" thing, are you saying that even with a two line hostfile, and forcing the plm/rsh
, the regex includes all the (100 iirc) nodes from the slurm allocation ?
here is a patch that fixes the issue that occurs when the regex is too long. note it force the string being handled as too long for debugging purpose
@rhc54 can you please have a look at it ? @karasevb can you please give it a try ?
diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..8e941e7 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1575,12 +1575,20 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
}
/* if this is too long, then we'll have to do it with
* a phone home operation instead */
- if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
+ if (false && strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
opal_argv_append(argc, argv, "-"OPAL_MCA_CMD_LINE_ID);
opal_argv_append(argc, argv, "orte_node_regex");
opal_argv_append(argc, argv, param);
/* mark that the nidmap has been communicated */
orte_nidmap_communicated = true;
+ } else if (NULL != orte_tree_launch_cmd) {
+ assert (!orte_nidmap_communicated);
+ if (ORTE_SUCCESS != (rc = opal_dss.pack(orte_tree_launch_cmd, ¶m, 1, OPAL_STRING))) {
+ ORTE_ERROR_LOG(rc);
+ return rc;
+ }
+ /* mark that the nidmap has been communicated */
+ orte_nidmap_communicated = true;
}
free(param);
diff --git a/orte/mca/plm/rsh/plm_rsh_module.c b/orte/mca/plm/rsh/plm_rsh_module.c
index 7b5ae93..3968de7 100644
--- a/orte/mca/plm/rsh/plm_rsh_module.c
+++ b/orte/mca/plm/rsh/plm_rsh_module.c
@@ -802,6 +802,25 @@ static int remote_spawn(opal_buffer_t *launch)
goto cleanup;
}
+ /* extract the node regex if needed, and update the routing tree */
+ if (NULL == orte_node_regex) {
+ n = 1;
+ if (ORTE_SUCCESS != (rc = opal_dss.unpack(launch, &orte_node_regex, &n, OPAL_STRING))) {
+ ORTE_ERROR_LOG(rc);
+ goto cleanup;
+ }
+
+ if (NULL != orte_node_regex) {
+ if (ORTE_SUCCESS != (rc = orte_util_nidmap_parse(orte_node_regex))) {
+ ORTE_ERROR_LOG(rc);
+ goto cleanup;
+ }
+ /* be sure to update the routing tree so any tree spawn operation
+ * properly gets the number of children underneath us */
+ orte_routed.update_routing_plan(NULL);
+ }
+ }
+
/* get the updated routing list */
rtmod = orte_rml.get_routed(orte_coll_conduit);
OBJ_CONSTRUCT(&coll, opal_list_t);
@ggouaillardet I confirm, this patch fixes the hang.
"about the "lot of nodes" thing, are you saying that even with a two line hostfile, and forcing the plm/rsh, the regex includes all the (100 iirc) nodes from the slurm allocation ?"
I think it might, yes. Even though we are only launching daemons on the initial limited set of nodes, the orte_node_pool still includes all the nodes in the slurm allocation, and that is what is used to build the regex. We filter the allocation prior to defining the daemons for the initial launch.
IIRC, the rationale was that a subsequent comm_spawn might utilize the rest of that allocation, and we need the node pool on the daemons to match that in mpirun for the distributed mapping to work. However, we could probably get that to update correctly since we now include a revised regex in the launch command.
Might be worth a shot, if you have a few cycles to try it. Alternatively, it also might be useful to get the regex to handle the hyphen correctly. As I said, we have never seen someone do that (until now), but it isn't forbidden, and so it is always possible that someone else might someday do it.
I will take a crack at it tomorrow.
About the hyphen, a cheap trick might be to convert -
to #
(e.g. invalid hostname character), then create and expand the regex before converting it back #
to -
True - thanks!
@rhc54 when running from n0
with n1
and n2
in my hostfile, and with --mca routed_radix 1
, i noted orted
is ssh'ed by n1
on n2
with -mca orte_hnp_uri ...
, and hence n2
first "phone home" to n0
.
is this the intended behavior ?
what i really mean is i would have expected n2
only contacts n1
.
Hmmm...it would be nice if n2 was given the port for n1 so it can callback to there. That certainly is true for static ports, but may be an optimization for the non-static case that we didn't write yet.
thanks, i'll add this to my todo list !
Just to follow-up on the "phone home" question. There is an MCA param that is supposed to be set on the cmd line by the daemon on n1 (in your example) when tree-spawning n2. This directs the n2 daemon to send its "phone home" message to the daemon on n1 and provides the necessary contact info.
What actually happens today is that n2 will send the message directly to the HNP on n0, but it will route the message thru n1. So the message is indeed received at n0 - it just went thru the routing tree.
If you want the message to actually land on the daemon on n1, then you would need to modify the code in orte/orted/orted_main.c:
on line 674:
ret = orte_rml_base_parse_uris(orte_parent_uri, &parent, NULL);
We should be putting the process name into ORTE_PROC_MY_PARENT, not a local "parent" location. It is possible that we update ORTE_PROC_MY_PARENT elsewhere, so you might check
on line 701:
if (ORTE_SUCCESS != (ret = orte_routed.update_route(NULL, ORTE_PROC_MY_HNP, &parent))) {
This is actually okay - just need to add a call to update the route to ORTE_PROC_MY_PARENT
on line 722:
if (orte_fwd_mpirun_port || orte_static_ports) {
Include a check of ORTE_PROC_MY_PARENT - if it is set, then we send to the parent instead of the HNP.
Note that the rollup code already exists in orted_main, so this should all flow correctly from there. It would probably be a good idea to use the rollup as it dramatically reduces the number of messages handled by the HNP.
thanks for the lengthy explanation. unfortunately, it does not match what i remember (and i will keep investigating on Monday)
iirc, orted
on n2
is invoked with --orte_hnp_uri
and the uri of n0
(and not n1
), and lsof
shown that n2
establishes the oob/tcp
connection to n0
(and not n1
), and the messages goes directly to n0
without being routed via n1
.
i naively tried to have orted
on n2
invoked with --orte_parent_uri
and the uri of n1
, but the phone home message cannot be sent to n0
because there is no route available at that stage.
i will resume this from Monday.
I fixed it (https://github.com/open-mpi/ompi/pull/4628) - found that treespawn wasn't actually doing what it should anyway. Every daemon had to phone directly home and then get a "remote spawn" message to launch its children. Root cause was the missing parent URI that prevented rollup.
This was fixed by @ggouaillardet
If a cluster is configured with
-
in the names of their nodes, then the ORTE regex parser gets confused and thinks the hyphen is part of a range designation. This leads to unusually large regex results and can cause odd behavior (e.g., unable to tree spawn).Fortunately, we haven't seen this in 14 years, but Mellanox now has a cluster that hits the problem. No rules against such configuration, so probably something that should be addressed in case someone else does it.