natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

Try to retrieve job status from accounting. #39

Closed holtgrewe closed 2 years ago

holtgrewe commented 4 years ago

At least in the case where Slurm accounting information is stored in a MySQL database, jobs that are completed are not available via the slurm.h functionality. Instead, the exit code has to be retrieved using the slurmdb.h functionality.

This patch extends the "assume failed if no job was found" in slurmdrmaa_job_on_missing() to look into slurmdb instead.

holtgrewe commented 4 years ago

@natefoo ping ;)

holtgrewe commented 3 years ago

@natefoo ping

holtgrewe commented 3 years ago

ping

natefoo commented 3 years ago

Thanks! I will try to review this ASAP. I'm wondering how the Slurm commands do this, however. Job information should be available from slurmctld for at least the value of MinJobAge, which defaults to 300 seconds. I am not sure that it's correct (and is a significant change from current behavior) to have slurm-drmaa query SlurmDBD directly.

holtgrewe commented 3 years ago

Hm, maybe this could be activated with an environment variable? For other schedulers such as grid engine, there is no distinction between the scheduler knowledge of jobs and accounting so this would homogenize the behaviour of DRMAA between schedulers.

natefoo commented 3 years ago

I like the idea of an environment variable, or the config file could be used. I agree that this feature would be nice to do, especially since the DRMAA abstraction breaks down when you have to go to the DRM tools to do things (we essentially do what you're doing here in our application so it's not as if this isn't a necessary function!).

holtgrewe commented 2 years ago

@natefoo thanks for the feeback and sorry for the delay. I have added an environment variable check and rebased to current master. What do you think?

natefoo commented 2 years ago

I have a fix for the test error in a followup, it only occurs on older Slurm versions.