open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Not enough slots available #5798

Closed harrysonj97 closed 6 years ago

harrysonj97 commented 6 years ago

I've wrote a simple Hello World program on C and i couldn't seem to execute the code using Open MPI up to more than 5 processes.

I'm using the latest version of OpenMPI, 3.1.2 and I've installed it on my mac by following this tutorial: https://intothewave.wordpress.com/2011/12/27/install-open-mpi-on-mac-os-x/

The problem is, even with --oversubscribe used i get an error message in the end.

Here's my C code:

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
    int rank; //rank of the process
    int size; //number of processes

    MPI_Init(&argc,&argv); //inititate MPI environment
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&size);

    printf("Hello world from process %d of %d\n",rank,size);
    MPI_Finalize();
    return 0;
}

and i run it on my terminal:

mpicc -o hello helloworld.c
mpirun --oversubsribe -np 10 hello

Output:

Hello world from process 0 of 10
Hello world from process 2 of 10
Hello world from process 3 of 10
Hello world from process 9 of 10
Hello world from process 7 of 10
Hello world from process 1 of 10
Hello world from process 6 of 10
Hello world from process 5 of 10
Hello world from process 4 of 10
Hello world from process 8 of 10
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

 Local host:  Harrys-MacBook-Pro.local
 System call: unlink(2) /var/folders/1t/zwstt6ds6n38qxxmytsb15mm0000gn/T//ompi.Harrys- 
 MacBook-Pro.501/pid.1347/1/vader_segment.Harrys-MacBook-Pro.50e10001.9
 Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[Harry-MacBook-Pro.local:01347] 3 more processes have sent help message help-opal- 
shmem-mmap.txt / sys call fail
[Harry-MacBook-Pro.local:01347] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages

Would really appreciate some help on this

Update: If i run the command:

mpirun --mca shmem posix --oversubscribe -np 10 hello

it works without the error but am still wondering if there's any possible fix to execute the usual command without any errors?

jsquyres commented 6 years ago

Looks like there's 2 problems here:

  1. This is a well-known issue with max filename lengths on MacOS. You can export TMPDIR=/tmp to avoid this problem.
  2. Even when doing this, however, I sometimes get a different seg fault. @rhc54 have you seen this before? It looks like something is running out of memory (notice the show help complaints) and/or is trying to malloc something that is way too big (see the mach_vm_map error), which is... weird.
[JSQUYRES-M-26UT:87388] [[3244,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space in file util/show_help.c at line 507
[JSQUYRES-M-26UT:87388] [[3244,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space in file util/show_help.c at line 507
mpirun(87388,0x7fff9df4a380) malloc: *** mach_vm_map(size=18446744073392484352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
[JSQUYRES-M-26UT:87388] [[3244,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  JSQUYRES-M-26UT
  System call: unlink(2) /tmp/ompi.JSQUYRES-M-26UT.504/pid.87388/1/vader_segment.JSQUYRES-M-26UT.cac0001.5
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
mpirun(87388,0x70000226f000) malloc: *** mach_vm_map(size=1125899906846720) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
[JSQUYRES-M-26UT:87388] *** Process received signal ***
[JSQUYRES-M-26UT:87388] Signal: Segmentation fault: 11 (11)
[JSQUYRES-M-26UT:87388] Signal code: Address not mapped (1)
[JSQUYRES-M-26UT:87388] Failing at address: 0x0
[JSQUYRES-M-26UT:87388] [ 0] 0   libsystem_platform.dylib            0x00007fff65744f5a _sigtramp + 26
[JSQUYRES-M-26UT:87388] [ 1] 0   ???                                 0x0000000005d608a8 0x0 + 97913000
[JSQUYRES-M-26UT:87388] [ 2] 0   mca_rml_oob.so                      0x000000010949edac orte_rml_oob_send_buffer_nb + 988
[JSQUYRES-M-26UT:87388] [ 3] 0   libopen-rte.40.dylib                0x000000010911ef08 pmix_server_log_fn + 472
[JSQUYRES-M-26UT:87388] [ 4] 0   mca_pmix_pmix2x.so                  0x00000001092eb75d server_log + 925
[JSQUYRES-M-26UT:87388] [ 5] 0   mca_pmix_pmix2x.so                  0x00000001093266c6 pmix_server_log + 1302
[JSQUYRES-M-26UT:87388] [ 6] 0   mca_pmix_pmix2x.so                  0x0000000109315aff server_message_handler + 4959
[JSQUYRES-M-26UT:87388] [ 7] 0   mca_pmix_pmix2x.so                  0x0000000109355066 pmix_ptl_base_process_msg + 774
[JSQUYRES-M-26UT:87388] [ 8] 0   libopen-pal.40.dylib                0x00000001091ef89a opal_libevent2022_event_base_loop + 1706
[JSQUYRES-M-26UT:87388] [ 9] 0   mca_pmix_pmix2x.so                  0x000000010932ce6e progress_engine + 30
[JSQUYRES-M-26UT:87388] [10] 0   libsystem_pthread.dylib             0x00007fff6574e661 _pthread_body + 340
[JSQUYRES-M-26UT:87388] [11] 0   libsystem_pthread.dylib             0x00007fff6574e50d _pthread_body + 0
[JSQUYRES-M-26UT:87388] [12] 0   libsystem_pthread.dylib             0x00007fff6574dbf9 thread_start + 13
[JSQUYRES-M-26UT:87388] *** End of error message ***
[1]    87388 segmentation fault (core dumped)  mpirun --oversubscribe -np 16 hello_c
rhc54 commented 6 years ago

No, I haven't seen that anywhere before - do you know at what point in the program this happens?

harrysonj97 commented 6 years ago

Strange indeed. export TMPDIR=/tmp allowed me to mpirun --oversubscribe -np 10 hello but if i increase it to 20 i get the same error.

jsquyres commented 6 years ago

@ggouaillardet's post may be relevant here: https://www.mail-archive.com/devel@lists.open-mpi.org/msg20760.html

From my analysis, here is what happens :

  • each rank is supposed to have its own vader_segment unlinked by btl/vader in vader_finalize().
  • but this file might have already been destroyed by an other task in orte_ess_base_app_finalize()
 if (NULL == opal_pmix.register_cleanup) {
       orte_session_dir_finalize(ORTE_PROC_MY_NAME);
  }

all the tasks end up removing opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1")

I am not really sure about the best way to fix this.

  • one option is to perform an intra node barrier in vader_finalize()
  • an other option would be to implement an opal_pmix.register_cleanup

Any thoughts ?

rhc54 commented 6 years ago

I thought we had fixed this by implementing the register_cleanup option, but maybe it didn't get to the v3.x release branches?

ggouaillardet commented 6 years ago

I reproduced the issue with the latest master and embedded PMIx.

A workaround could be to mpirun --mca btl_vader_backing_directory /tmp ...

rhc54 commented 6 years ago

Hmmm...let me check master and ensure that the OPAL wrapper function didn't get lost somewhere. There should be no way that orte_session_dir_finalize call got executed on master.

rhc54 commented 6 years ago

Confirmed - that function pointer is definitely not NULL, so that function is never called.

ggouaillardet commented 6 years ago

I am pretty sure it was NULL for me. I will double check that tomorrow.

You checked that on the MPI app side (e.g. not mpirun nor orted) right ?

rhc54 commented 6 years ago

I just looked at the opal/pmix code and confirmed that (a) there is a function entry in the opal_pmix module and (b) the required "glue" code is present. Thus, if you are using the internal PMIx code, that function pointer cannot be NULL.

You might check to verify you didn't configure for an external (older) version of PMIx, just to be safe?

ggouaillardet commented 6 years ago

@rhc54 well, we are both right, kind of ...

from orte_ess_base_app_finalize()

    if (NULL != opal_pmix.finalize) {
        opal_pmix.finalize();
        (void) mca_base_framework_close(&opal_pmix_base_framework);
    }
    (void) mca_base_framework_close(&orte_oob_base_framework);
    (void) mca_base_framework_close(&orte_state_base_framework);

    if (NULL == opal_pmix.register_cleanup) {
        orte_session_dir_finalize(ORTE_PROC_MY_NAME);

closing the PMIx framework (re)sets opal_pmix.register_cleanup to NULL (it used to be pmix4x_register_cleanup), that is why orte_session_dir_finalize() is always invoked.

The inline patch below fixes this, can you please review it ?

diff --git a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c
index a02711f..52eaee0 100644
--- a/orte/mca/ess/base/ess_base_std_app.c
+++ b/orte/mca/ess/base/ess_base_std_app.c
@@ -13,7 +13,7 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.  All rights
  *                         reserved.
  * Copyright (c) 2013-2018 Intel, Inc. All rights reserved.
- * Copyright (c) 2014-2016 Research Organization for Information Science
+ * Copyright (c) 2014-2018 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2015      Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2018      Mellanox Technologies, Inc.
@@ -320,6 +320,7 @@ int orte_ess_base_app_setup(bool db_restrict_local)

 int orte_ess_base_app_finalize(void)
 {
+    bool orte_cleanup = (NULL == opal_pmix.register_cleanup);
     /* release the conduits */
     orte_rml.close_conduit(orte_mgmt_conduit);
     orte_rml.close_conduit(orte_coll_conduit);
@@ -341,7 +342,7 @@ int orte_ess_base_app_finalize(void)
     (void) mca_base_framework_close(&orte_oob_base_framework);
     (void) mca_base_framework_close(&orte_state_base_framework);

-    if (NULL == opal_pmix.register_cleanup) {
+    if (orte_cleanup) {
         orte_session_dir_finalize(ORTE_PROC_MY_NAME);
     }
     /* cleanup the process info */
rhc54 commented 6 years ago

Why not just close the pmix framework a little later? It shouldn't be closed until after all of ORTE has finalized.

ggouaillardet commented 6 years ago

I think you know better than me :-)

my concern is if there is no register_cleanup(), orte_session_dir_dinalize() might delete some files used by PMIx. If you tell me no such thing can ever occur, then yes, simply close the PMIx framework after all orte has finalized.

I'll be happy to issue a PR based on your directions.

rhc54 commented 6 years ago

Not sure what it is that I should "know better", but I think this is pretty simple to resolve. I'll ponder it a little after I finish the current work. I'm not wild about this proposed fix as I think the issue might well persist.

ggouaillardet commented 6 years ago

I apologize I chose my words poorly.

I did not mean anything malicious and wanted to say I do not know and you know better (than me) where to close the pmix framework, so I leave it up to you.

Thanks

rhc54 commented 6 years ago

I didn't interpret it as anything mean - I just didn't understand it, that's all. Let me try to capture the scenarios here so perhaps you can move forward before I have time to address it. PMIx and HWLOC both have shared memory files in the session directory, but they are at the daemon's level and shouldn't be impacted by the apps. Cleanup in general has two major use-cases to consider:

Dealing with the session directory itself in the direct launch case where the RM doesn't provide cleanup requires that the app procs call orte_session_dir_finalize. This is the only time the apps should do so. The function already checks for RM cleanup and so it is safe to call in either case. Thus, the correct fix here is to (a) check for direct launch (ORTE_SCHIZO_DIRECT_LAUNCHED == orte_schizo.check_launch_environment()) and if true, then (b) call orte_session_dir_finalize. You can finalize PMIx first or not - shouldn't matter.

This still leaves the issue of the Vader files placed outside the session dir. Unfortunately, checking to see if opal_pmix.register_cleanup is NULL isn't sufficient in itself - the PMIx client library simply relays any cleanup registration to the local PMIx server, which may or may not support that feature. For example, the current Slurm PMIx plugin does not support it, so even though the OMPI function may be non-NULL, cleanup registration will fail. This will return an error code (in opal/mca/pmix/pmix4x/pmix4x.c), but we don't currently save it.

The reason we don't bother to save it is, quite simply, that the app can't do anything about it. Vader will already try to remove its files - knowing that registration failed doesn't tell the app anything new. Registration only provides a bit of backup for those cases where the app fails to remove the file due to some internal issue.

The only solution I can think of would be to have opal/pmix return the registration status code. If Vader sees that registration fails, then perhaps it should fall back to placing the backing file in the session directory to ensure it gets cleaned up by other local app procs when they call orte_session_dir_finalize.

HTH

alarcher commented 5 years ago

@rhc54 The issue still exists in 4.0.1 on illumos even if I apply this patchset. Any idea how I can diagnose and provide relevant feedback to you?

rhc54 commented 5 years ago

I'm afraid you'll have to tell me more - precisely what issue are you talking about? What patch did you apply?

alarcher commented 5 years ago

Sorry for the lack of details. I way refering to issues such as:

narval> mpirun -n 10 ./a.out 
Hello world from processor rank 3 of 10
Hello world from processor rank 2 of 10
Hello world from processor rank 8 of 10
Hello world from processor rank 0 of 10
Hello world from processor rank 9 of 10
Hello world from processor rank 7 of 10
Hello world from processor rank 4 of 10
Hello world from processor rank 5 of 10
Hello world from processor rank 6 of 10
Hello world from processor rank 1 of 10
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  narval
  System call: unlink(2) /tmp/ompi.narval.101/pid.28219/1/vader_segment.narval.84960001.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[narval:28219] 2 more processes have sent help message help-opal-shmem-mmap.txt / sys call fail
[narval:28219] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

which is at least similar to the one reported above.

I applied the patch at https://github.com/open-mpi/ompi/commit/c076be52afc19b1a8c1884ff7e66b04122c7ab23

but also tried the workaround suggested by @ggouaillardet

Let me know what kind of output would be useful (truss, dtrace, ...).

Kind regards,

Aurélien

rhc54 commented 5 years ago

I'd suggest trying the latest nightly tarball of the 4.0.x branch as the fix may have already been committed there:

https://www.open-mpi.org/nightly/v4.0.x/

alarcher commented 5 years ago

The issue is still present in openmpi-v4.0.x-201908090241-6d62fb0, should I follow up in another ticket?

rhc54 commented 5 years ago

Probably best to do so - I'm out of ideas.

leofang commented 4 years ago

Hit this issue in v4.0.3 (https://github.com/conda-forge/openmpi-feedstock/pull/58).