Open BenjaminRodenberg opened 4 years ago
I did some history research:
In https://github.com/precice/precice/pull/299 we observed that mpi4py
was missing in the python solverdummy. In https://github.com/precice/precice/pull/316, we decided to move it directly into the bindings and remove it from the solverdummy.
There is also https://github.com/precice/precice/issues/311
I looked a bit more into history and found the original reason for adding the statement from mpi4py import MPI
here.
I would suggest to first reproduce this kind of behaviour in a test in order to make sure that it's worth all the trouble.
edit Update to new location of solverdummy and preCICE v2.0.0
In the following I will provide a description how to provoke the "mpi4py" error mentioned in https://github.com/precice/precice/pull/299#issuecomment-469421697
Use preCICE revision https://github.com/precice/precice/commit/9f778290416416255fc73a495e962def301648b0 preCICE v2.0.0
Build and install preCICE via
mkdir build
cd build
cmake -DBUILD_SHARD_LIBS=ON -DPRECICE_PETScMapping=OFF -DPRECICE_MPICommunication=<ON|OFF>.
make -j4
sudo make install
Use python bindings revision https://github.com/precice/python-bindings/commit/7ddf2894644bb596e3ddbf772059ed98ab61b5ed python-bindings revision https://github.com/precice/python-bindings/pull/36/commits/3ad6d0ee0eb29d29991095f9d3fa85bdefa671f0 and remove this line
Install bindings via pip3 install --user .
Navigate to cd precice/tools/solverdummies
cd python-bindings/solverdummy
Run ~python3 python/solverdummy.py precice-config.xml SolverOne MeshOne
and python3 python/solverdummy.py precice-config.xml SolverTwo MeshTwo
~ python3 solverdummy/solverdummy.py solverdummy/precice-config.xml SolverOne MeshOne
and python3 solverdummy/solverdummy.py solverdummy/precice-config.xml SolverTwo MeshTwo
-DPRECICE_MPICommunication=OFF
, everything works as expected-DPRECICE_MPICommunication=ON
, we get the following error:~/precice/tools/solverdummies$ python3 python/solverdummy.py precice-config.xml SolverOne MeshOne
[2020-01-16 21:18:51.319269] [0x00007f78ae21db80] [trace] Entering operator()
[2020-01-16 21:18:51.319303] [0x00007f78ae21db80] [debug] Initialize MPI
[benjamin-ThinkPad-X1-Yoga-2nd:12451] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
[benjamin-ThinkPad-X1-Yoga-2nd:12451] mca_base_component_repository_open: unable to open mca_shmem_mmap: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[benjamin-ThinkPad-X1-Yoga-2nd:12451] mca_base_component_repository_open: unable to open mca_shmem_posix: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[benjamin-ThinkPad-X1-Yoga-2nd:12451] mca_base_component_repository_open: unable to open mca_shmem_sysv: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[benjamin-ThinkPad-X1-Yoga-2nd:12451] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-DPRECICE_MPICommunication=ON
, but we add the line from mpi4py import MPI
before or after import precice
in solverdummy.py
, everything works.I checked the Error described in https://github.com/precice/python-bindings/issues/8#issuecomment-575329609 for the code provided in #36 . The error still persists.
Another idea that might help us closing this issue: preCICE allows to check whether it was compiled with MPI or not through SolverInterface::getVersionInformation
. This might be a good way to determine whether mpi (or mpi4py
) is needed or not. Then we can drop mpi4py
as a mandatory dependency and depending on what SolverInterface::getVersionInformation
returns raise a warning (or error), if mpi4py
is needed, but cannot be found.
Continue closed PR https://github.com/precice/precice/pull/312 here.