nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
746 stars 530 forks source link

Nipype multiproc/forkserver struct errors #2924

Closed dPys closed 5 years ago

dPys commented 5 years ago

Has anyone ever come across the following type of struct error with nipype's multiproc when running a workflow? (and in particular one that runs with a forkserver)?

exception calling callback for <Future at 0x1c19df8a90 state=finished raised error>
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
 File "/usr/local/anaconda3/lib/python3.7/concurrent/futures/process.py", line 198, in _sendback_result
  exception=exception))
 File "/usr/local/anaconda3/lib/python3.7/multiprocessing/queues.py", line 364, in put
  self._writer.send_bytes(obj)
 File "/usr/local/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
  self._send_bytes(m[offset:offset + size])
 File "/usr/local/anaconda3/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
  header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

This seems to occur both with MultiProc and LegacyMultiProc on nipype 1.1.9 and dev/2.0. Linear execution works fine.

The problem appears to occur when any iterable is used on this workflow. My hunch is that it is related to a serialization issue (in particular connecting/pickling high memory dipy objects across nipype nodes), but I've been testing a number of possibilities. If no obvious solution comes to mind, please let me know and I can provide a minimal example with an accompanying docker container!

Cheers, @dPys

{'commit_hash': 'd9976942c',
 'commit_source': 'installation',
 'networkx_version': '2.2',
 'nibabel_version': '2.4.0',
 'nipype_version': '2.0.0-dev',
 'numpy_version': '1.16.2',
 'pkg_path': '/usr/local/anaconda3/lib/python3.7/site-packages/nipype-2.0.0.dev0+gd9976942c-py3.7.egg/nipype',
 'scipy_version': '1.2.1',
 'sys_executable': '/usr/local/anaconda3/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.7.3 (default, Mar 27 2019, 16:54:48) \n'
                '[Clang 4.0.1 (tags/RELEASE_401/final)]',
 'traits_version': '5.0.0'}
effigies commented 5 years ago

Have you tried with other versions of Python?

dPys commented 5 years ago

Interesting question @effigies . In Python2.7, the workflow appears to just hang actually, and on an earlier node of the workflow.

Two further observations-- 1) there are as many state errors that emerge as there are iterables used (see case of a two-process traceback below where it repeats). 2) The issue did not go away when I restructured the nodes to only pass file paths (i.e. .trk streamline files), as opposed to passing nibabel streamlines objects themselves.

exception calling callback for <Future at 0x104858400 state=finished raised error>
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.7/concurrent/futures/process.py", line 198, in _sendback_result
    exception=exception))
  File "/usr/local/anaconda3/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/usr/local/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/local/anaconda3/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/nipype-2.0.0.dev0+gd9976942c-py3.7.egg/nipype/pipeline/plugins/multiproc.py", line 149, in _async_callback
    result = args.result()
  File "/usr/local/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Right now I'm trying to eliminate the passing of any/all dipy objects (e.g a gradient table) to see if that makes a difference.

Let me know if you have further ideas!

Thanks, @dPys

dPys commented 5 years ago

Fix failed :/

Here's the DAG if it helps at all to conceptualize what's going on:

graph

dPys commented 5 years ago

Solved. Because DWI data matrices are 4d, some can exceed 4GB and can't be serialized. Closing this now, but thanks for the help @effigies !