natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

segfault with unresponsive/timeout socket #7

Closed unode closed 6 years ago

unode commented 6 years ago

If libdrmaa fails to contact SLURM due to system overload, a temporary network interruption or a timeout a "socket error" is sometimes seen and is immediately followed by a segfault. This is likely to be due to improper handling of the error.

natefoo commented 6 years ago

This may also have been fixed by a6f33912f28a0d6e97a6df062eaa671c23ea5c8b. If not, it'd be helpful to know specific cases where this occurs. Each place in the code where it can occur has to handle the error condition separately.

unode commented 6 years ago

This one is hard to reproduce... So far I've only seen it when the submission node or the network is overloaded. Reproducing it might mean a couple of angry emails from local IT :smile:.

Feel free to close. I'll re-open if I see it again. Thanks for all the fixes.

natefoo commented 6 years ago

I don't doubt that there are cases where this occurs, thanks for reporting. If I get a chance I'll audit all the slurm calls to make sure they're all error-handled.