Open karl-koschutnig opened 10 months ago
Hey,
I have the same problem working with apptainer
on my university's HPC. I have a sbatch
script to process several subject in parallels:
OSError: [Errno 98] Address already in use
Traceback (most recent call last):
File "/opt/conda/bin/mriqc", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/mriqc/cli/run.py", line 104, in main
with Manager() as mgr:
File "/opt/conda/lib/python3.9/multiprocessing/context.py", line 57, in Manager
m.start()
File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 558, in start
self._address = reader.recv()
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
Process SyncManager-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
server = cls._Server(registry, address, authkey, serializer)
File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
self.listener = Listener(address=address, backlog=16)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
self._listener = SocketListener(address, family, backlog)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
self._socket.bind(address)
It may be that nipype's multiprocessing plugin is raising errors because several mriqc
runs are attempting to use the same port to communicate with subprocesses. Ideally, this issue would be addressed in the nipype plugin by allowing the multiprocessing plugin to choose a port that is not already in use.
As a workaround, I am currently trying to use the --net --network none
option with the apptainer run
command so that each container operates on its own local network, preventing conflicts between jobs. It appears to be working for now, and I will update you once the jobs are completed.
This has nothing to do with nipype but instead Python's multiprocessing managers: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Manager
I've never seen this issue before, and it looks like Manager()
doesn't take any arguments. This doesn't look like an easy thing for us to fix, apart from stopping building the workflow in an external process.
Yes of course, I didn't mean that the problem comes from nipype
per se, but it seems to be possible to tell multiprocessing.Manager()
to automatically choose a free port as described in this stackoverflow thread.
But I understand the issue is hard to test and might create other issues.
So far the --net --network none
workaround seems to work but might create issue for templateflow
to get templates. (In my case, templates are already on disk)
@vferat -- we'd love to add your name as a contributor if you submitted a patch and/or an update to the documentation reporting your workaround ;)
Otherwise, I'm afraid we'll likely close this as won't fix.
What happened?
Hi. I am unsure if this is the right place to ask my question...? I want to use mriqc with nextflow. The idea is that nextflow runs mriqc for all subjects in parallel (or at least a subsample of the subjects). I use an apptainer-container to start mriqc and everything works fine until there is more than one CPU involved (and more than one is kind of the whole idea). So I am not sure if the problem is with the apptainer-setup, the nextflow-setup or with mriqc. I tend to think it is a Python (3.9) problem; thats why I post it here ?!
What command did you use?
What version of the software are you running?
23.1.0
How are you running this software?
Other
Is your data BIDS valid?
Yes
Are you reusing any previously computed results?
No
Please copy and paste any relevant log output.
Additional information / screenshots
No response