Closed lmoureaux closed 2 months ago
Thanks for the info @lmoureaux !
IMHO, this is indeed an odd constraint of unix sockets (the limit is 107 bytes if I recall correctly).
It might depend on the actual cluster configuration, but afaik slurm jobs use the submission directory (usually transparently seen by the submission node) as the base for temporary files, which
That's why law changes the default to a place that is cleaned up after the job terminates, but I think we should change that behavior for slurm.
A temporary fix on your side could be along these lines: https://github.com/columnflow/columnflow/blob/835020b57280947f500e232241b014131735d8f2/columnflow/tasks/framework/remote.py#L813-L818
However, I will make sure something similar is done centrally within law.
The fix I referenced above is generic enough, so I simply moved it to the slurm base workflow. Reopen the issue in case the issue persists :+1:
Thanks! I agree that the limit on UNIX sockets is weird, but it's baked into socket sockaddr_un
and it would be a major change to make it longer... For backward compatibility one would basically need to duplicate the whole API.
Bug description
On slurm,
law
explicitly changesTMPDIR
to a place within the job's own directory. This breaks packages relying on the default arguments ofmultiprocessing.Listener
because this involves creating a UNIX socket inTMPDIR
, and socket paths can only be so long.It seems
multiprocessing.Queue
exposes the issue, but cannot 100% confirm.Example stack trace (followed by a deadlock):