Closed tbooth closed 6 years ago
FSD_ERRNO_TIMEOUT
becomes DRMAA_ERRNO_EXIT_TIMEOUT
and is used when calling e.g. drmaa_wait()
with a timeout specified and that timeout is reached. I believe that FSD_ERRNO_DRM_COMMUNICATION_FAILURE
is the correct error code.
Thanks for the patch!
Fixed in 83fc28856134cebaf73e8c605230e68fc8f1d420
Hi,
Thanks for merging my previous fix. This one is in a similar vein.
On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.
Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.
Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.
Cheers,
TIM