OSError: [Errno 98] Address already in use

karl-koschutnig commented 10 months ago

What happened?

Hi. I am unsure if this is the right place to ask my question...? I want to use mriqc with nextflow. The idea is that nextflow runs mriqc for all subjects in parallel (or at least a subsample of the subjects). I use an apptainer-container to start mriqc and everything works fine until there is more than one CPU involved (and more than one is kind of the whole idea). So I am not sure if the problem is with the apptainer-setup, the nextflow-setup or with mriqc. I tend to think it is a Python (3.9) problem; thats why I post it here ?!

What command did you use?

This is the command (through) nextflow: 
mriqc /bids /out participant     -w /tmp --resource-monitor --no-sub     --nprocs 1 --omp-nthreads 1 -m bold --participant-label sub-122BPAF172043

So, just one proc/thread

What version of the software are you running?

23.1.0

How are you running this software?

Other

Is your data BIDS valid?

Yes

Are you reusing any previously computed results?

No

Please copy and paste any relevant log output.

ERROR ~ Error executing process > 'mriqc (27)'

Caused by:
  Process `mriqc (27)` terminated with an error exit status (1)

Command executed:

  mriqc /bids /out participant     -w /tmp --resource-monitor --no-sub     --nprocs 1 --omp-nthreads 1 -m bold --participant-label sub-122BPAF172043

Command exit status:
  1

Command output:
  (empty)

Command error:
  Process SyncManager-2:
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
      self.run()
    File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
      server = cls._Server(registry, address, authkey, serializer)
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
      self.listener = Listener(address=address, backlog=16)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
      self._listener = SocketListener(address, family, backlog)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
      self._socket.bind(address)
  OSError: [Errno 98] Address already in use
  Traceback (most recent call last):
    File "/opt/conda/bin/mriqc", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/lib/python3.9/site-packages/mriqc/cli/run.py", line 104, in main
      with Manager() as mgr:
    File "/opt/conda/lib/python3.9/multiprocessing/context.py", line 57, in Manager
      m.start()
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 558, in start
      self._address = reader.recv()
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
      buf = self._recv_bytes()
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
      buf = self._recv(4)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
      raise EOFError
  EOFError

Additional information / screenshots

No response

vferat commented 10 months ago

Hey,

I have the same problem working with apptainer on my university's HPC. I have a sbatch script to process several subject in parallels:

OSError: [Errno 98] Address already in use
Traceback (most recent call last):
  File "/opt/conda/bin/mriqc", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/mriqc/cli/run.py", line 104, in main
    with Manager() as mgr:
  File "/opt/conda/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
    self._socket.bind(address)

It may be that nipype's multiprocessing plugin is raising errors because several mriqc runs are attempting to use the same port to communicate with subprocesses. Ideally, this issue would be addressed in the nipype plugin by allowing the multiprocessing plugin to choose a port that is not already in use.

As a workaround, I am currently trying to use the --net --network none option with the apptainer run command so that each container operates on its own local network, preventing conflicts between jobs. It appears to be working for now, and I will update you once the jobs are completed.

effigies commented 10 months ago

This has nothing to do with nipype but instead Python's multiprocessing managers: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Manager

I've never seen this issue before, and it looks like Manager() doesn't take any arguments. This doesn't look like an easy thing for us to fix, apart from stopping building the workflow in an external process.

vferat commented 10 months ago

Yes of course, I didn't mean that the problem comes from nipypeper se, but it seems to be possible to tell multiprocessing.Manager() to automatically choose a free port as described in this stackoverflow thread.

But I understand the issue is hard to test and might create other issues.

So far the --net --network none workaround seems to work but might create issue for templateflowto get templates. (In my case, templates are already on disk)

oesteban commented 3 months ago

@vferat -- we'd love to add your name as a contributor if you submitted a patch and/or an update to the documentation reporting your workaround ;)

Otherwise, I'm afraid we'll likely close this as won't fix.

nipreps / mriqc