natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

"unknown signal?!" reported from JobInfo terminatedSignal #26

Open EricR86 opened 5 years ago

EricR86 commented 5 years ago

Hello,

I'm not sure if this the origin of this particular bug, but I have not successfully reproduced this error on other DRMAA implementations.

I've submitted jobs to my SLURM 18.08 system where, occassionally, I get a reported "unknown signal?!". The exact same job, when resubmitted, may or may not have this issue. I cannot track down exactly what happens when this occurs or what causes this.

I have run strace on the job itself that was submitted on equivalent jobs, one which reports the "unknown signal" vs a regular exiting job and I cannot find any discernable difference and notably when tracing specifically for any signals.

sacct reports nothing unusual, and actually seems to indicate that the job exited without issue. The sysadmin for our cluster system seems to agree and cannot find any issue.

This could be a cluster-specific issue, DRMAA issue, or not. If I'm looking in the wrong place please kindly redirect me. I'm not sure where or how I could start tracking down this issue.

Thanks for your time.