silx-kit / fabio

I/O library for images produced by 2D X-ray detector
Other
57 stars 52 forks source link

interference between fabio and mpi4py #218

Closed alemirone closed 1 year ago

alemirone commented 6 years ago

The problem appears on the infiniband nodes. It consists in the script (below) hanging or crashing. More rarely it works. If you do the import of mpi4py before fabio the problem disappears.

There is also another feature which is a normal behaviour that gives some warning from mpi4py complaining that somebody is doing a fork ( it is done by sub.Popen(args=string.split(comando, " ") ,stdout=sub.PIPE,stderr=sub.PIPE) ) but in principle the spawned process ends immediately before mpi library starts working. The spawn is necessary to get information for the other parts of the program from which the short script is extracted, so I thought it should not create problem if it is executed somewhere at the very beginning of the program. What happens seems to indicate that fabio is doing something under the hoods

import sys import string import os import fabio import mpi4py.MPI as MPI import subprocess as sub

comando = 'taskset -cp %d'%(os.getpid()) print(" EXECUTING COMMAND ", comando) p = sub.Popen(args=string.split(comando, " ") ,stdout=sub.PIPE,stderr=sub.PIPE) print(" WAITING ") cpuset_string, errors = p.communicate() print cpuset_string, errors

vallsv commented 6 years ago

Hi. Which Python version did you use? Which fabio version did you use? Which mpi4py? Did you try with Python3?

vallsv commented 6 years ago

No problem here, on Debian8, Python2 of the system, Python3 of the system, custom Python3.5 with fresh lib from pypi.

If you need help, it could be useful to put hands on your environment.

alemirone commented 6 years ago

Hello Valentin,  it happens on the infiniband cluster : nodes hib-something The environment is accessible  through OAR :  http://wikiserv.esrf.fr/software/index.php/Main_Page

No enviroment however, just the python from Debian of the cluster

On 04/30/2018 09:11 AM, Valentin Valls wrote:

No problem here, on Debian8, Python2 of the system, Python3 of the system, custom Python3.5 with fresh lib from pypi.

If you need help, it could be useful to put hands on your environment.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/silx-kit/fabio/issues/218#issuecomment-385326175, or mute the thread https://github.com/notifications/unsubscribe-auth/ACaVdy-XSyTHc_2qb84ByMqLbpMk8tFGks5ttrkTgaJpZM4Tlb8U.

vallsv commented 6 years ago

Reproduced using

ssh -X rnice
oarsub -q ib -I
python script.py
[hib2-1508:49216] *** Process received signal ***
[hib2-1508:49216] Signal: Segmentation fault (11)
[hib2-1508:49216] Signal code: Address not mapped (1)
[hib2-1508:49216] Failing at address: 0x3141208

The use of gdb creates a deadlock or have an infinitloop

vallsv commented 6 years ago

Well, i check to remove things from fabio with a virtualenv.

I dont find anything which really works. Still have deadlock or segfault if i remove most of the fabio modules. Then no idea what's going on.

But using the last mpi4py 3.0 looks to fix the problem. The one on your system is the 1.3.1

kif commented 1 year ago

recent version of mpi4py fix the issue... so the issue was in mpi4py rather than in fabio.