0 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
1 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
/home/hande/dev/dicodile/env/bin/python
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
2 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:04908] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
3 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:04932] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
For example for the above example, hostfile_test specifies 16 slots. For 1st iteration, it spawns 6 processes, then raises
exception. However the processes continue to run. For second iteration it starts 6 more processes, 12 in total. For 3rd iteration, as it has 3 slots left, it complains that there are not enough slots.
I tried the same with 20 slots and it complained in 4th iteration after initializing 18 processes in the first 3.
Similar problem while running plot_mandrill.py example with 16 slots in hostfile with the command:
mpirun -np 1 --hostfile hostfile python -m mpi4py examples/plot_mandrill.py
Replace is False and data exists, so doing nothing. Use replace=True to re-download the data.
[DEBUG:DICODILE] Lambda_max = 11.274413430904202
0 Exception
[hande-VirtualBox:05655] [[58362,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 9 slots
that were requested by the application:
/home/hande/dev/dicodile/env/bin/python
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
1 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:05655] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
3 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
6 Exception
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[hande-VirtualBox:05664] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
Unit tests fail on ubuntu 18.04 with openmpi 2.1.1 after renaming
dicodile.py
to_dicodile.py
and exposingdicodile
function in__init__.py
asWhile running the test:
dicodile/update_z/tests/test_dicod.py::test_stopping_criterion[6-signal_support0-atom_support0]
It returns
Each exception occurs at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L127
while trying to spawn workers at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L120
The code spawns specified number of processes (6 in this case). The processes start executing the specified
main_worker.py
script. However it stops at https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/main_worker.py#L6 where it tries to import fromdicodile
package.I've tried adding lines before the import line, all run until the import line. But then it fails silently.
For nb_workers = [1, 2], the code runs without problems.
For nb_workers = 6, it raises exception in spawning processes.
I thought, the code was not able to access hostfile_test, however I realized that the loop starting at line https://github.com/tomMoral/dicodile/blob/1b54bacbc5c60389324608efe3462cd6e2514870/dicodile/workers/reusable_workers.py#L110 continues running and spawning the specified number of processes in each iteration. It complains about insufficient number of slots when the number of slots in hostfile_test would exceed at that iteration.
For example for the above example, hostfile_test specifies 16 slots. For 1st iteration, it spawns 6 processes, then raises exception. However the processes continue to run. For second iteration it starts 6 more processes, 12 in total. For 3rd iteration, as it has 3 slots left, it complains that there are not enough slots.
I tried the same with 20 slots and it complained in 4th iteration after initializing 18 processes in the first 3.
Similar problem while running
plot_mandrill.py
example with 16 slots in hostfile with the command:mpirun -np 1 --hostfile hostfile python -m mpi4py examples/plot_mandrill.py