Closed zhngaj closed 2 years ago
Seems ties to -x
specifically (at least in this case)
./exports/bin/mpirun --np 2 /bin/hostname -x 42 -i
fails the same way, but removing -x 42
and it works.
Good analysis; thanks for identifying the issue.
Given that the command line parsing is now "owned" by Open MPI (remember: the code resides in PRTE, but it's an OMPI-specific plugin), I think it's probably our responsibility to fix it.
Has anyone looked into the OMPI schitzo component in PRTE to fix this?
It doesn't specifically seem to be tied to the ompi plugin, using prterun will fail in the same way:
$ ./exports/bin/prterun --np 1 /bin/hostname -x 42 -i
--------------------------------------------------------------------------
An unrecognized option was included on the prterun command line:
Option: -i
Please use the "prterun --help" command to obtain a list of all
supported options.
--------------------------------------------------------------------------
Failed in this case (without -x in application's option) as well. Note that the complained 576
is the one after -npmin
mpirun --prefix /fsx/ompi/install --wdir results/imb -n 576 --map-by ppr:36:node --rank-by slot --hostfile /fsx/hosts.file -x PATH -x LD_LIBRARY_PATH -x FI_LOG_LEVEL /fsx/IMB-MPI1 -npmin 576 -iter 200 -exclude Reduce_local
--------------------------------------------------------------------------
An unrecognized option was included on the prterun command line:
Option: 576
Please use the "prterun --help" command to obtain a list of all
supported options.
--------------------------------------------------------------------------
The solution is going to have to be effected over in the PRTE code base and then the submodule here in OMPI will need to be updated to get the fix. It is highly likely that someone from the OMPI community will need to be involved in making the PRTE fix (read: it might not get fixed if we don't contribute the fix).
Is it possible to revert the recent change to ompi https://github.com/open-mpi/ompi/commit/6ea18b79a5ec6188d56f0bf5ceb57d4f47dcd262 that updated pmix?
Is the problem in PMIx or PRTE? I had assumed it was PRTE.
It would probably be best to "fail forward", and get a fix in PRTE and/or PMIx and get that fix in an updated submodule here. I say this because -- presumably -- that submodule update fixed other issues, and we probably want to keep those fixes.
Is this issue causing pain for you? Can you use --
to delineate the mpirun/prterun args from the application+args as a workaround?
It is a bug in prrte's general command line parsing - it doesn't seem to stop at the users executable. This naive patch seems to do the trick:
diff --git a/src/util/cmd_line.c b/src/util/cmd_line.c
index 99716bf8e1..ecfe269f52 100644
--- a/src/util/cmd_line.c
+++ b/src/util/cmd_line.c
@@ -111,7 +111,7 @@ int prte_cmd_line_parse(char **pargv, char *shorts,
/* reset the parser - must be done each time we use it
* to avoid hysteresis */
- optind = 0;
+ optind = 1;
opterr = 0;
optopt = 0;
optarg = NULL;
@@ -119,6 +119,11 @@ int prte_cmd_line_parse(char **pargv, char *shorts,
// run the parser
while (1) {
argind = optind;
+ // This is the executable - or should be. Don't process
+ // any further.
+ if ('-' != argv[optind][0]) {
+ break;
+ }
opt = getopt_long(argc, argv, shorts, myoptions, &option_index);
if (-1 == opt) {
break;
it seems to work in everything I have thrown at it, but..
@zhngaj could you try that?
Does prterun
support a --
argument that tells it to stop parsing?
Thanks, @awlauria I got some weird issue when verifying your fix earlier, I will try it again.
@jdinan @jsquyres - it appears that --
can be used as a work-around from my testing.
./exports/bin/prterun --np 1 -- ./tst -x 2 -i
works - while:
./exports/bin/prterun --np 1 ./tst -x 2 -i
does not.
@zhngaj the fix has been merged into pmix master - and will go into the prrte v2.1 branch for v5.0.x to pick up at this time. Let us know if you find any additional issues with the patch.
Pulled prrte v2.1 branch for submodule prrte, option parsing works correctly for me. Thanks @awlauria
[ec2-user@compute-dy-c5n18xlarge-2 ompi]$ git submodule status
b98b7e83519963f7ba484ea34449668ea6834a0e 3rd-party/openpmix (v1.1.3-3376-gb98b7e83)
+3af4e8ea59f76a16ee3308c3f2ed3abc8f8c2cc2 3rd-party/prrte (v2.0.2-21-g3af4e8ea59)
@awlauria OMPI master branch also need to update PMIX submodule pointer to ingest this change?
@zhngaj @wzamazon pr's posted to ompi master and v5.0.x to pick up these changes.
Submodule pointers updated to bring the fixes in to master/v5.0.x. Closing.
Background information
Use git clone https://github.com/open-mpi/ompi.git
master branch c06bb8adc3348f54534d6d466cc10064fcd92374
build with
./configure --prefix=/fsx/ompi/install --with-ofi=/opt/amazon/efa/ --with-hwloc=external --disable-man-pages
git submodule status
outputsDetails of the problem
prterun tries to parse the option of the application itself, and fail.
This should be because Updating PMIx and PRRTE submodule pointers.. If I revert this change, the test can pass.