natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

all errors reported as FSD_ERRNO_INTERNAL_ERROR #1

Closed tbooth closed 6 years ago

tbooth commented 7 years ago

Hi,

Thanks for merging my previous fix. This one is in a similar vein.

On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.

Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.

Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.

*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig 2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c  2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****

            if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
                self->on_missing(self);
!           } else {
!               fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
            }
        }
        if (job_info) {
--- 131,150 ----

            if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
                self->on_missing(self);
!           } else
!                 // We should detect the error corresponding to "Socket timed out" and report
!                 // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
!                 // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
!                 //   which simply indicates the job is still running?? Maybe we should try it and see. )
!                 // To see what _slurm_errno corresponds to which message let's look at
!                 // 'slurm_strerror' in the slurm source code...
!                 //   https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
!             if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
!                  _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
!                ) {
!                 fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
!             } else {
!               fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
            }
        }
        if (job_info) {

Cheers,

TIM

natefoo commented 6 years ago

FSD_ERRNO_TIMEOUT becomes DRMAA_ERRNO_EXIT_TIMEOUT and is used when calling e.g. drmaa_wait() with a timeout specified and that timeout is reached. I believe that FSD_ERRNO_DRM_COMMUNICATION_FAILURE is the correct error code.

Thanks for the patch!

natefoo commented 6 years ago

Fixed in 83fc28856134cebaf73e8c605230e68fc8f1d420