tomMoral / dicodile

Experiments for "Distributed Convolutional Dictionary Learning (DiCoDiLe): Pattern Discovery in Large Images and Signals"
https://tommoral.github.io/dicodile/
BSD 3-Clause "New" or "Revised" License
18 stars 10 forks source link

Problem spawning processes on ubuntu-18.04 with openmpi 2.1.1 #12

Open hndgzkn opened 3 years ago

hndgzkn commented 3 years ago

Unit tests fail on ubuntu 18.04 with openmpi 2.1.1 after renaming dicodile.py to _dicodile.py and exposing dicodile function in __init__.py as

from ._dicodile import dicodile

__all__ = ['dicodile']

While running the test:

dicodile/update_z/tests/test_dicod.py::test_stopping_criterion[6-signal_support0-atom_support0]

It returns

0 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
1 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
  /home/hande/dev/dicodile/env/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
2 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:04908] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
3 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:04932] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Each exception occurs at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L127

while trying to spawn workers at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L120

The code spawns specified number of processes (6 in this case). The processes start executing the specified main_worker.py script. However it stops at https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/main_worker.py#L6 where it tries to import from dicodile package.

I've tried adding lines before the import line, all run until the import line. But then it fails silently.

For nb_workers = [1, 2], the code runs without problems.

For nb_workers = 6, it raises exception in spawning processes.

I thought, the code was not able to access hostfile_test, however I realized that the loop starting at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L110 continues running and spawning the specified number of processes in each iteration. It complains about insufficient number of slots when the number of slots in hostfile_test would exceed at that iteration.

For example for the above example, hostfile_test specifies 16 slots. For 1st iteration, it spawns 6 processes, then raises exception. However the processes continue to run. For second iteration it starts 6 more processes, 12 in total. For 3rd iteration, as it has 3 slots left, it complains that there are not enough slots.

I tried the same with 20 slots and it complained in 4th iteration after initializing 18 processes in the first 3.

Similar problem while running plot_mandrill.py example with 16 slots in hostfile with the command: mpirun -np 1 --hostfile hostfile python -m mpi4py examples/plot_mandrill.py

Replace is False and data exists, so doing nothing. Use replace=True to re-download the data.
[DEBUG:DICODILE] Lambda_max = 11.274413430904202
0 Exception
[hande-VirtualBox:05655] [[58362,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 9 slots
that were requested by the application:
  /home/hande/dev/dicodile/env/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
1 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:05655] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
3 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
6 Exception
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[hande-VirtualBox:05664] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195