open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

prterun tries to parse the option of the application itself, and fail #10044

Closed zhngaj closed 2 years ago

zhngaj commented 2 years ago

Background information


Details of the problem

prterun tries to parse the option of the application itself, and fail.

[ec2-user@compute-dy-c5n18xlarge-1 ompi]$ mpirun -n 2 --map-by ppr:1:node --rank-by slot --hostfile /fsx/hosts /fsx/osu_barrier -x 1 -i 1
--------------------------------------------------------------------------
An unrecognized option was included on the prterun command line:

  Option: -i

Please use the "prterun --help" command to obtain a list of all
supported options.
--------------------------------------------------------------------------

This should be because Updating PMIx and PRRTE submodule pointers.. If I revert this change, the test can pass.

awlauria commented 2 years ago

Seems ties to -x specifically (at least in this case)

./exports/bin/mpirun --np 2 /bin/hostname -x 42 -i

fails the same way, but removing -x 42 and it works.

jsquyres commented 2 years ago

Good analysis; thanks for identifying the issue.

Given that the command line parsing is now "owned" by Open MPI (remember: the code resides in PRTE, but it's an OMPI-specific plugin), I think it's probably our responsibility to fix it.

Has anyone looked into the OMPI schitzo component in PRTE to fix this?

awlauria commented 2 years ago

It doesn't specifically seem to be tied to the ompi plugin, using prterun will fail in the same way:

$ ./exports/bin/prterun --np 1 /bin/hostname -x 42 -i
--------------------------------------------------------------------------
An unrecognized option was included on the prterun command line:

  Option: -i

Please use the "prterun --help" command to obtain a list of all
supported options.
--------------------------------------------------------------------------
zhngaj commented 2 years ago

Failed in this case (without -x in application's option) as well. Note that the complained 576 is the one after -npmin

mpirun --prefix /fsx/ompi/install --wdir results/imb -n 576 --map-by ppr:36:node  --rank-by slot --hostfile /fsx/hosts.file -x PATH -x LD_LIBRARY_PATH -x FI_LOG_LEVEL /fsx/IMB-MPI1 -npmin 576 -iter 200 -exclude Reduce_local
--------------------------------------------------------------------------
An unrecognized option was included on the prterun command line:

  Option: 576

Please use the "prterun --help" command to obtain a list of all
supported options.
--------------------------------------------------------------------------
jsquyres commented 2 years ago

The solution is going to have to be effected over in the PRTE code base and then the submodule here in OMPI will need to be updated to get the fix. It is highly likely that someone from the OMPI community will need to be involved in making the PRTE fix (read: it might not get fixed if we don't contribute the fix).

wzamazon commented 2 years ago

Is it possible to revert the recent change to ompi https://github.com/open-mpi/ompi/commit/6ea18b79a5ec6188d56f0bf5ceb57d4f47dcd262 that updated pmix?

jsquyres commented 2 years ago

Is the problem in PMIx or PRTE? I had assumed it was PRTE.

It would probably be best to "fail forward", and get a fix in PRTE and/or PMIx and get that fix in an updated submodule here. I say this because -- presumably -- that submodule update fixed other issues, and we probably want to keep those fixes.

Is this issue causing pain for you? Can you use -- to delineate the mpirun/prterun args from the application+args as a workaround?

awlauria commented 2 years ago

It is a bug in prrte's general command line parsing - it doesn't seem to stop at the users executable. This naive patch seems to do the trick:

diff --git a/src/util/cmd_line.c b/src/util/cmd_line.c
index 99716bf8e1..ecfe269f52 100644
--- a/src/util/cmd_line.c
+++ b/src/util/cmd_line.c
@@ -111,7 +111,7 @@ int prte_cmd_line_parse(char **pargv, char *shorts,

     /* reset the parser - must be done each time we use it
      * to avoid hysteresis */
-    optind = 0;
+    optind = 1;
     opterr = 0;
     optopt = 0;
     optarg = NULL;
@@ -119,6 +119,11 @@ int prte_cmd_line_parse(char **pargv, char *shorts,
     // run the parser
     while (1) {
         argind = optind;
+        // This is the executable - or should be. Don't process
+        // any further.
+        if ('-' != argv[optind][0]) {
+          break;
+        }
         opt = getopt_long(argc, argv, shorts, myoptions, &option_index);
         if (-1 == opt) {
             break;

it seems to work in everything I have thrown at it, but..

@zhngaj could you try that?

jdinan commented 2 years ago

Does prterun support a -- argument that tells it to stop parsing?

zhngaj commented 2 years ago

Thanks, @awlauria I got some weird issue when verifying your fix earlier, I will try it again.

awlauria commented 2 years ago

@jdinan @jsquyres - it appears that -- can be used as a work-around from my testing.

./exports/bin/prterun --np 1 -- ./tst -x 2 -i

works - while:

./exports/bin/prterun --np 1 ./tst -x 2 -i

does not.

awlauria commented 2 years ago

@zhngaj the fix has been merged into pmix master - and will go into the prrte v2.1 branch for v5.0.x to pick up at this time. Let us know if you find any additional issues with the patch.

zhngaj commented 2 years ago

Pulled prrte v2.1 branch for submodule prrte, option parsing works correctly for me. Thanks @awlauria

[ec2-user@compute-dy-c5n18xlarge-2 ompi]$ git submodule status
 b98b7e83519963f7ba484ea34449668ea6834a0e 3rd-party/openpmix (v1.1.3-3376-gb98b7e83)
+3af4e8ea59f76a16ee3308c3f2ed3abc8f8c2cc2 3rd-party/prrte (v2.0.2-21-g3af4e8ea59)
wzamazon commented 2 years ago

@awlauria OMPI master branch also need to update PMIX submodule pointer to ingest this change?

awlauria commented 2 years ago

@zhngaj @wzamazon pr's posted to ompi master and v5.0.x to pick up these changes.

awlauria commented 2 years ago

Submodule pointers updated to bring the fixes in to master/v5.0.x. Closing.