open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

ORTE regex fails when nodenames include hyphens #4621

Closed rhc54 closed 6 years ago

rhc54 commented 6 years ago

If a cluster is configured with - in the names of their nodes, then the ORTE regex parser gets confused and thinks the hyphen is part of a range designation. This leads to unusually large regex results and can cause odd behavior (e.g., unable to tree spawn).

Fortunately, we haven't seen this in 14 years, but Mellanox now has a cluster that hits the problem. No rules against such configuration, so probably something that should be addressed in case someone else does it.

rhc54 commented 6 years ago

Just to further clarify: this problem will occur regardless of the plm launcher being used (e.g., when using mpirun under slurm) as we always try to pass the regex to the backend daemons on launch, and there is no guarantee that the length restriction will save us like it did in the referenced case.

ggouaillardet commented 6 years ago

@rhc54 i am a bit puzzled with this ...

i renamed my virtual cluster n-0, n-1 and so on, but was unable to reproduce the hang (this is not a large cluster, so the regex length is small enough.

there is a more obvious error when the regex is too long, and this does not involve - in the regex.

that can be simply reproduced with the inline patch below, and i can reproduce the issue when running

mpirun -npernode 1 -hostfile m -mca plm rsh -mca oob tcp -mca routed_radix 1 --mca plm_base_verbose 10 hostname

from n0 and my hostfile m contains

n1
n2

at first glance, i did not see n1 perform the phone home operation, but if you can point me to there, i will be happy to debug it.

diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..54930db 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1573,6 +1573,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
         ORTE_ERROR_LOG(rc);
         return rc;
     }
+#if 0
     /* if this is too long, then we'll have to do it with
      * a phone home operation instead */
     if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
@@ -1582,6 +1583,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
         /* mark that the nidmap has been communicated */
         orte_nidmap_communicated = true;
     }
+#endif
     free(param);

     if (!orte_static_ports && !orte_fwd_mpirun_port) {
rhc54 commented 6 years ago

If you only have two nodes in the hostfile, such as the Mellanox folks had, then the regex cannot be more than 1024 characters in length (which is the cmd_length limit). We run over 20k nodes with it and have never hit the limit.

I would look at the regex generator and see what it is doing wrong in that scenario.

rhc54 commented 6 years ago

One place to start might be to simply print out the regex and see why it is so long...

ggouaillardet commented 6 years ago

ok, let me put it this way then,

i created a cluster whose nodes are called frontend, node-001 and node-002 and i am unable to reproduce the issue @karasevb is facing.

the regex (that can be seen with --mca plm_base_verbose 10) is a simple expansion of the host list, in this case it is frontend,node-001,node-002@0(3).

should we fix the real issue (e.g. the phone home does not work) before optimizing the regex generation ?

rhc54 commented 6 years ago

I honestly don't think that is the real issue here. The regex isn't long enough to be getting rejected, so the question is: why is it failing for Mellanox when it doesn't fail for you?

Once we understand and fix that, then sure - we can look into why the alternative code path isn't working. I honestly don't remember the code path for that case, so it would take a little looking to see how it is supposed to work.

ggouaillardet commented 6 years ago

@rhc54 i am fine with that.

@karasevb is Open MPI installed on NFS ? if not, can you please double check the very same Open MPI is deployed on all the nodes ? i have a real hard time trying to reproduce the issue, and all i can think of is (as @rhc54 previously suggested it) some node is running an older version of v3.0.x without the fix.

rhc54 commented 6 years ago

Another thing we could do is provide a patch for orte/util/regex.c that adds some debug and have @karasevb run it to see what is going on over there.

rhc54 commented 6 years ago

@ggouaillardet Actually, I just realized that your output indeed shows the problem. The regex should have compressed node-001 and node-002 into a regex like node-[3:1-2]. The hyphen is confusing it into thinking every node name is unique and cannot be compressed.

This is why Mellanox is failing. They have a lot of nodes (we don't filter them when making the original regex), and the regex generator thinks they are all unique names. Otherwise, it would have output something like "node-[3:1-999]".

ggouaillardet commented 6 years ago

@rhc54 minor detail, the regex should be node-00[3:1-2] about the "lot of nodes" thing, are you saying that even with a two line hostfile, and forcing the plm/rsh, the regex includes all the (100 iirc) nodes from the slurm allocation ?

ggouaillardet commented 6 years ago

here is a patch that fixes the issue that occurs when the regex is too long. note it force the string being handled as too long for debugging purpose

@rhc54 can you please have a look at it ? @karasevb can you please give it a try ?

diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..8e941e7 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1575,12 +1575,20 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
     }
     /* if this is too long, then we'll have to do it with
      * a phone home operation instead */
-    if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
+    if (false && strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
         opal_argv_append(argc, argv, "-"OPAL_MCA_CMD_LINE_ID);
         opal_argv_append(argc, argv, "orte_node_regex");
         opal_argv_append(argc, argv, param);
         /* mark that the nidmap has been communicated */
         orte_nidmap_communicated = true;
+    } else if (NULL != orte_tree_launch_cmd) {
+        assert (!orte_nidmap_communicated);
+        if (ORTE_SUCCESS != (rc = opal_dss.pack(orte_tree_launch_cmd, &param, 1, OPAL_STRING))) {
+            ORTE_ERROR_LOG(rc);
+            return rc;
+        }
+        /* mark that the nidmap has been communicated */
+        orte_nidmap_communicated = true;
     }
     free(param);

diff --git a/orte/mca/plm/rsh/plm_rsh_module.c b/orte/mca/plm/rsh/plm_rsh_module.c
index 7b5ae93..3968de7 100644
--- a/orte/mca/plm/rsh/plm_rsh_module.c
+++ b/orte/mca/plm/rsh/plm_rsh_module.c
@@ -802,6 +802,25 @@ static int remote_spawn(opal_buffer_t *launch)
         goto cleanup;
     }

+    /* extract the node regex if needed, and update the routing tree */
+    if (NULL == orte_node_regex) {
+        n = 1;
+        if (ORTE_SUCCESS != (rc = opal_dss.unpack(launch, &orte_node_regex, &n, OPAL_STRING))) {
+            ORTE_ERROR_LOG(rc);
+            goto cleanup;
+        }
+
+        if (NULL != orte_node_regex) {
+            if (ORTE_SUCCESS != (rc = orte_util_nidmap_parse(orte_node_regex))) {
+                ORTE_ERROR_LOG(rc);
+                goto cleanup;
+            }
+            /* be sure to update the routing tree so any tree spawn operation
+             * properly gets the number of children underneath us */
+            orte_routed.update_routing_plan(NULL);
+        }
+    }
+
     /* get the updated routing list */
     rtmod = orte_rml.get_routed(orte_coll_conduit);
     OBJ_CONSTRUCT(&coll, opal_list_t);
karasevb commented 6 years ago

@ggouaillardet I confirm, this patch fixes the hang.

rhc54 commented 6 years ago

"about the "lot of nodes" thing, are you saying that even with a two line hostfile, and forcing the plm/rsh, the regex includes all the (100 iirc) nodes from the slurm allocation ?"

I think it might, yes. Even though we are only launching daemons on the initial limited set of nodes, the orte_node_pool still includes all the nodes in the slurm allocation, and that is what is used to build the regex. We filter the allocation prior to defining the daemons for the initial launch.

IIRC, the rationale was that a subsequent comm_spawn might utilize the rest of that allocation, and we need the node pool on the daemons to match that in mpirun for the distributed mapping to work. However, we could probably get that to update correctly since we now include a revised regex in the launch command.

Might be worth a shot, if you have a few cycles to try it. Alternatively, it also might be useful to get the regex to handle the hyphen correctly. As I said, we have never seen someone do that (until now), but it isn't forbidden, and so it is always possible that someone else might someday do it.

ggouaillardet commented 6 years ago

I will take a crack at it tomorrow.

About the hyphen, a cheap trick might be to convert - to # (e.g. invalid hostname character), then create and expand the regex before converting it back # to -

rhc54 commented 6 years ago

True - thanks!

ggouaillardet commented 6 years ago

@rhc54 when running from n0 with n1 and n2 in my hostfile, and with --mca routed_radix 1, i noted orted is ssh'ed by n1 on n2 with -mca orte_hnp_uri ..., and hence n2 first "phone home" to n0.

is this the intended behavior ? what i really mean is i would have expected n2 only contacts n1.

rhc54 commented 6 years ago

Hmmm...it would be nice if n2 was given the port for n1 so it can callback to there. That certainly is true for static ports, but may be an optimization for the non-static case that we didn't write yet.

ggouaillardet commented 6 years ago

thanks, i'll add this to my todo list !

rhc54 commented 6 years ago

Just to follow-up on the "phone home" question. There is an MCA param that is supposed to be set on the cmd line by the daemon on n1 (in your example) when tree-spawning n2. This directs the n2 daemon to send its "phone home" message to the daemon on n1 and provides the necessary contact info.

What actually happens today is that n2 will send the message directly to the HNP on n0, but it will route the message thru n1. So the message is indeed received at n0 - it just went thru the routing tree.

If you want the message to actually land on the daemon on n1, then you would need to modify the code in orte/orted/orted_main.c:

Note that the rollup code already exists in orted_main, so this should all flow correctly from there. It would probably be a good idea to use the rollup as it dramatically reduces the number of messages handled by the HNP.

ggouaillardet commented 6 years ago

thanks for the lengthy explanation. unfortunately, it does not match what i remember (and i will keep investigating on Monday)

iirc, orted on n2 is invoked with --orte_hnp_uri and the uri of n0 (and not n1), and lsof shown that n2 establishes the oob/tcp connection to n0 (and not n1), and the messages goes directly to n0 without being routed via n1.

i naively tried to have orted on n2 invoked with --orte_parent_uri and the uri of n1, but the phone home message cannot be sent to n0 because there is no route available at that stage.

i will resume this from Monday.

rhc54 commented 6 years ago

I fixed it (https://github.com/open-mpi/ompi/pull/4628) - found that treespawn wasn't actually doing what it should anyway. Every daemon had to phone directly home and then get a "remote spawn" message to launch its children. Root cause was the missing parent URI that prevented rollup.

rhc54 commented 6 years ago

This was fixed by @ggouaillardet