Closed denvdm closed 4 years ago
Yeah, true. To me SLURM scheduling policy is quite cryptic - not just TSD but also Abel / SAGA. I've read through https://slurm.schedmd.com/priority_multifactor.html , but it kind of useless without having access to the exact slurm.conf
file.
It's possible to run sshare -a -l
on tsd to get some more info:
But these numbers are not clear to me.
I think it's fair to close this: queue system is documented here: https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/hpc/queue-system.html , and we have further notes here: http://norment.awiki.org/dokuwiki/tsd_hpc_sessions . If there are more specific questions please re-open.
Very few, if any, of us fully understands the inner workings of the queueing system, there should be clearer guidelines, e.g. about how to optimize job specifications and what may slow down the system. As example, I think we´ve all had this frustrating experience where we submit jobs and they just stand hours in queue, with nothing happening. Checking qsumm sometimes then shows p33 running (yes, status running) e.g. 800 jobs, while none are listed by squeue. Is this a bug? Is there something wrong with our job specifications? Unclear.