reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
214 stars 101 forks source link

Query for the pending job reason using `squeue` may fail and erroneously report the test job as a failure #3122

Closed vkarak closed 5 months ago

vkarak commented 6 months ago

Here's the test's stderr:

  * Reason: spawned process error: command 'squeue -h -j 12962 -o %r' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
slurm_load_jobs error: Invalid job id specified
--- stderr ---

In the past, squeue didn't fail if the passed job id didn't exist. The error comes from this part of the code:

https://github.com/reframe-hpc/reframe/blob/8395b49dc86c2a57a21ee6a3285950fde089c69b/reframe/core/schedulers/slurm.py#L485

We should also ignore squeue's failure and assume that job has finished.