rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 131 forks source link

multiprocessing: OSError: AF_UNIX path too long #1571

Open michelwi opened 3 weeks ago

michelwi commented 3 weeks ago

Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is OSError: AF_UNIX path too long in the _TouchFilesThread.run.

The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.

I try to attach one version of each flavor of traces in order of appearence:

OSError ``` Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server server = cls._Server(registry, address, authkey, serializer) File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__ self.listener = Listener(address=address, backlog=16) File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__ self._listener = SocketListener(address, family, backlog) File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__ self._socket.bind(address) OSError: AF_UNIX path too long ```
_TouchFilesThread ``` Unhandled exception in thread <_MainThread(MainThread, started 139642623541696)>, proc 96008. Thread <_TouchFilesThread(Thread-1, started daemon 139638043833920)>: Process SyncManager-1: File "/usr/lib/python3.10/threading.py", line 324, in Condition.wait line: gotit = waiter.acquire(True, timeout) locals: gotit = False waiter = waiter.acquire = timeout = 1.0 File "/usr/lib/python3.10/threading.py", line 607, in Event.wait line: signaled = self._cond.wait(timeout) locals: signaled = False self = self._cond = , 1)> self._cond.wait = , 1)>> timeout = 1.0 File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/util/file_cache.py", line 418, in _TouchFilesThread.run line: if self.stop.wait(self.interval): locals: self = <_TouchFilesThread(Thread-1, started daemon 140098675930688)> self.stop = self.stop.wait = > self.interval = 1.0 File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner line: self.run() locals: self = <_TouchFilesThread(Thread-1, started daemon 140098675930688)> self.run = > File "/usr/lib/python3.10/threading.py", line 973, in Thread._bootstrap line: self._bootstrap_inner() locals: self = <_TouchFilesThread(Thread-1, started daemon 140098675930688)> self._bootstrap_inner = > Thread current, main, <_MainThread(MainThread, started 140103104491968)>: (Excluded thread.) ```
Main ``` EXCEPTION Traceback (most recent call last): File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, in line: main() locals: main = File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 735, in main line: init(command_line_options=argv[1:]) locals: init = command_line_options = argv = ['/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py', '/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/returnn/training/ReturnnTrainingJob.V7zBp10cMkXR/output/returnn.config'], _[0]: {len = 142} File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 487, in init line: init_engine() locals: init_engine = File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 294, in init_engine line: engine = Engine(config=config) locals: engine = None Engine = config = File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/torch/engine.py", line 60, in Engine.__init__ line: self._mp_manager = torch.multiprocessing.Manager() locals: self = self._mp_manager = !AttributeError: 'Engine' object has no attribute '_mp_manager' torch = torch.multiprocessing = torch.multiprocessing.Manager = > File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in BaseContext.Manager line: m.start() locals: m = m.start = > File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in BaseManager.start line: self._address = reader.recv() locals: self = self._address = None reader = reader.recv = > File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in _ConnectionBase.recv line: buf = self._recv_bytes() locals: buf = self = self._recv_bytes = > File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in Connection._recv_bytes line: buf = self._recv(4) locals: buf = self = self._recv = > File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in Connection._recv line: raise EOFError locals: EOFError = EOFError Unhandled exception in thread <_MainThread(MainThread, started 139836318601664)>, proc 96015. ```
torch.distributed ``` E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in main() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 1 (local_rank: 1) exitcode : 1 (pid: 96009) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 2 (local_rank: 2) exitcode : 1 (pid: 96010) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 3 (local_rank: 3) exitcode : 1 (pid: 96011) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 4 (local_rank: 4) exitcode : 1 (pid: 96012) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 5 (local_rank: 5) exitcode : 1 (pid: 96013) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 6 (local_rank: 6) exitcode : 1 (pid: 96014) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 7 (local_rank: 7) exitcode : 1 (pid: 96015) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:18:03 host : g-28.apptek.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 96008) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ```

EDIT: fixed torch.distributed trace

albertz commented 3 weeks ago

(The torch.distributed output is somehow messed up. Do you have a version which is not messed up?)

albertz commented 3 weeks ago

From your logs, it does not seem like the error is in _TouchFilesThread.run?

It seems the error occurred in some sub proc. Can you give details on what sub proc this is?

You posted the stack of _TouchFilesThread, but it seems all fine? This one is not crashed and just idling?

But the main thread of the RETURNN main proc crashed, specifically in Torch Engine.__init__, in this line:

self._mp_manager = torch.multiprocessing.Manager()

So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; torch.multiprocessing is basically just the standard multiprocessing module).

albertz commented 3 weeks ago

I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket (self._socket.bind(address)), i.e. what is address here? Can you get those local vars?

NeoLegends commented 3 weeks ago

This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570

The read goes over a connection.Pipe(), which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556

and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82

which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145

For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket.

albertz commented 3 weeks ago

Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs:

def _candidate_tempdir_list():
    """Generate a list of candidate temporary directories which
    _get_default_tempdir will try."""

    dirlist = []

    # First, try the environment.
    for envname in 'TMPDIR', 'TEMP', 'TMP':
        dirname = _os.getenv(envname)
        if dirname: dirlist.append(dirname)

    # Failing that, try OS-specific locations.
    if _os.name == 'nt':
        dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
                         _os.path.expandvars(r'%SYSTEMROOT%\Temp'),
                         r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
    else:
        dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])

    # As a last resort, the current directory.
    try:
        dirlist.append(_os.getcwd())
    except (AttributeError, OSError):
        dirlist.append(_os.curdir)

    return dirlist

So, is TMPDIR (or some of the other) env var set? Where does it point to? If not, you should set it, and prefer /var/tmp.

albertz commented 3 weeks ago

But in any case, what is address actually? So we just know, and don't need to assume.

NeoLegends commented 3 weeks ago

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

albertz commented 3 weeks ago

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

So what is address in your case?

michelwi commented 3 weeks ago

what is address actually?

sorry I could not figure it out. Then my node was rebooted and the error is gone again. As I cannot reproduce it any more it would be fine by me to close this issue for now. But if @NeoLegends wants to keep debugging we can keep it.

NeoLegends commented 3 weeks ago

Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted], leading to a path like:

Initializing SocketListener with /nas/models/asr/mgunz/2024-06-24--[redacted]/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/work/pymp-zwqk0y_y/listener-kxe8tmti
Process SyncManager-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/home/mgunz/setups/2024-06-24--[redacted]/work/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/output/returnn.config", line 12, in print_addr
    return orig_init(self, address, *args, **kwargs)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me.

>>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3

Or this is something wrt. the child processes lacking some rights the parent process has?

I'm not sure what this is yet.

albertz commented 3 weeks ago

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted]

Yea that's a problem. It should definitely not use that.

>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp.

Yea that's strange. We should find out. Eg. just step-by-step debug through it.