Open michelwi opened 4 months ago
(The torch.distributed output is somehow messed up. Do you have a version which is not messed up?)
From your logs, it does not seem like the error is in _TouchFilesThread.run
?
It seems the error occurred in some sub proc. Can you give details on what sub proc this is?
You posted the stack of _TouchFilesThread
, but it seems all fine? This one is not crashed and just idling?
But the main thread of the RETURNN main proc crashed, specifically in Torch Engine.__init__
, in this line:
self._mp_manager = torch.multiprocessing.Manager()
So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; torch.multiprocessing
is basically just the standard multiprocessing
module).
I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket (self._socket.bind(address)
), i.e. what is address
here? Can you get those local vars?
This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570
The read goes over a connection.Pipe()
, which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556
and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82
which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145
For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket.
Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs:
def _candidate_tempdir_list():
"""Generate a list of candidate temporary directories which
_get_default_tempdir will try."""
dirlist = []
# First, try the environment.
for envname in 'TMPDIR', 'TEMP', 'TMP':
dirname = _os.getenv(envname)
if dirname: dirlist.append(dirname)
# Failing that, try OS-specific locations.
if _os.name == 'nt':
dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
_os.path.expandvars(r'%SYSTEMROOT%\Temp'),
r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
else:
dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])
# As a last resort, the current directory.
try:
dirlist.append(_os.getcwd())
except (AttributeError, OSError):
dirlist.append(_os.curdir)
return dirlist
So, is TMPDIR
(or some of the other) env var set? Where does it point to? If not, you should set it, and prefer /var/tmp
.
But in any case, what is address
actually? So we just know, and don't need to assume.
I just got bitten by the same error in a training not using the new dataset or caching mechanism.
I just got bitten by the same error in a training not using the new dataset or caching mechanism.
So what is address
in your case?
what is address actually?
sorry I could not figure it out. Then my node was rebooted and the error is gone again. As I cannot reproduce it any more it would be fine by me to close this issue for now. But if @NeoLegends wants to keep debugging we can keep it.
Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.
tempdir._get_default_tempdir
returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted]
, leading to a path like:
Initializing SocketListener with /nas/models/asr/mgunz/2024-06-24--[redacted]/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/work/pymp-zwqk0y_y/listener-kxe8tmti
Process SyncManager-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
server = cls._Server(registry, address, authkey, serializer)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
self.listener = Listener(address=address, backlog=16)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
self._listener = SocketListener(address, family, backlog)
File "/home/mgunz/setups/2024-06-24--[redacted]/work/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/output/returnn.config", line 12, in print_addr
return orig_init(self, address, *args, **kwargs)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
self._socket.bind(address)
OSError: AF_UNIX path too long
>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']
I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me.
>>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3
Or this is something wrt. the child processes lacking some rights the parent process has?
I'm not sure what this is yet.
tempdir._get_default_tempdir
returns the project folder, i.e./home/mgunz/setups/2024-06-24--[redacted]
Yea that's a problem. It should definitely not use that.
>>> _candidate_tempdir_list() ['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']
I'm not sure why it won't take e.g. /var/tmp.
Yea that's strange. We should find out. Eg. just step-by-step debug through it.
Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is
OSError: AF_UNIX path too long
in the_TouchFilesThread.run
.The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.
I try to attach one version of each flavor of traces in order of appearence:
OSError
``` Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server server = cls._Server(registry, address, authkey, serializer) File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__ self.listener = Listener(address=address, backlog=16) File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__ self._listener = SocketListener(address, family, backlog) File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__ self._socket.bind(address) OSError: AF_UNIX path too long ```_TouchFilesThread
``` Unhandled exceptionMain
``` EXCEPTION Traceback (most recent call last): File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, intorch.distributed
``` E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, inEDIT: fixed torch.distributed trace