Open reyjul opened 5 years ago
This is a bug in job.c - slurmdrmaa_find_job_info()/slurmdrmaa_job_on_missing(). Some time after a job finishes slurm_load_job() returns an error. That is expected slurm behavior. I'm not sure what on_missing() is supposed to do, but the code continues as though it filled job_info, when that's not true. I suspect it should throw error here. Alternately on_missing() could get job information from slurmdb. Unfortunately, I'm not familiar enough with this library or slurm to do it right now.
Hi,
When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:
Same happens if I loop through the job ids with the s.wait() function:
However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:
No problem if jobs last only 10 minutes:
I came up with this little piece of code to bypass the bug:
Yields: