open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

master/v5.0.x: `mpirun --with-ft ulfm` produces a Segmentation Fault #10285

Closed klaa97 closed 2 years ago

klaa97 commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0rc4, 5.0.0rc5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Distribution tarball for both versions

Please describe the system on which you are running


Details of the problem

After building the tarball with

shell$ ./configure --with-ft=mpi
shell$ make all install

I get the following output with any MPI file.

shell$ mpirun --with-ft ulfm ./hello_world # Segmentation fault (core dumped)
shell$ mpirun --with-ft mpi ./hello_world # Segmentation fault (core dumped)

Note that I get this output even specifying a not existing file as the executable; this leads me to believe that the problem is in the schizo parsing of the MPI cli options. I did a little digging and I suspect the problem might be somewhere here: https://github.com/openpmix/prrte/blob/9ae73d4d97f843fac994103f2232f6570baaba26/src/mca/schizo/ompi/schizo_ompi.c#L394

Note also that if I manually specify the MCA options which are pushed in the code directly from the command line, the ULFM support seems to work.

Thank you!

abouteiller commented 2 years ago

Hey @klaa97, there has been a short period of time where the mpiexec option was broken in PRTE.

Can you replicate with v5.0.0rc6 using configure --with-prte=internal?

abouteiller commented 2 years ago

This is fixed in PRTE master, but has not yet been imported in either ompi/main or v5.0.x.

The following changes in PRTE need to be imported: https://github.com/openpmix/prrte/pull/1302

awlauria commented 2 years ago

PR'd to the v2.1 branch here: https://github.com/openpmix/prrte/pull/1351

Will link the submodule update to v5 when we open that.

Loay-Tabbara commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0rc4, 5.0.0rc5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Distribution tarball for both versions

Please describe the system on which you are running

  • Network type: tcp

Details of the problem

After building the tarball with

shell$ ./configure --with-ft=mpi
shell$ make all install

I get the following output with any MPI file.

shell$ mpirun --with-ft ulfm ./hello_world # Segmentation fault (core dumped)
shell$ mpirun --with-ft mpi ./hello_world # Segmentation fault (core dumped)

Note that I get this output even specifying a not existing file as the executable; this leads me to believe that the problem is in the schizo parsing of the MPI cli options. I did a little digging and I suspect the problem might be somewhere here: https://github.com/openpmix/prrte/blob/9ae73d4d97f843fac994103f2232f6570baaba26/src/mca/schizo/ompi/schizo_ompi.c#L394

Note also that if I manually specify the MCA options which are pushed in the code directly from the command line, the ULFM support seems to work.

Thank you!

your comment saved my b***

export OMPI_MCA_mpi_ft_enable=true export PRTE_MCA_prte_enable_ft=1 that did the trick for me!
but am not sure why, but i am not able to set "np" flag to a low number for when i set it to a low number the program does not work well is there a way to set it as an environmental variable? thanx in advance :)

Update did set it using PRTE_MCA_prte_set_default_slots to wanted number and not using np flag but i still get a crash on low proc numbers as if the fault tolerance does not kick in

jsquyres commented 2 years ago

@awlauria @gpaulsen Did the recent v5.0.x submodule updates fix this issue?

abouteiller commented 2 years ago

I did not experience this issue with the latest v5.0.x fa738c5c