Closed unode closed 6 years ago
This may also have been fixed by a6f33912f28a0d6e97a6df062eaa671c23ea5c8b. If not, it'd be helpful to know specific cases where this occurs. Each place in the code where it can occur has to handle the error condition separately.
This one is hard to reproduce... So far I've only seen it when the submission node or the network is overloaded. Reproducing it might mean a couple of angry emails from local IT :smile:.
Feel free to close. I'll re-open if I see it again. Thanks for all the fixes.
I don't doubt that there are cases where this occurs, thanks for reporting. If I get a chance I'll audit all the slurm calls to make sure they're all error-handled.
If libdrmaa fails to contact SLURM due to system overload, a temporary network interruption or a timeout a "socket error" is sometimes seen and is immediately followed by a segfault. This is likely to be due to improper handling of the error.