norment / tsd_issues

Repo to track issues with TSD as tickets
2 stars 0 forks source link

Lack of information about queueing system #21

Closed denvdm closed 4 years ago

denvdm commented 4 years ago

Very few, if any, of us fully understands the inner workings of the queueing system, there should be clearer guidelines, e.g. about how to optimize job specifications and what may slow down the system. As example, I think we´ve all had this frustrating experience where we submit jobs and they just stand hours in queue, with nothing happening. Checking qsumm sometimes then shows p33 running (yes, status running) e.g. 800 jobs, while none are listed by squeue. Is this a bug? Is there something wrong with our job specifications? Unclear.

ofrei commented 4 years ago

Yeah, true. To me SLURM scheduling policy is quite cryptic - not just TSD but also Abel / SAGA. I've read through https://slurm.schedmd.com/priority_multifactor.html , but it kind of useless without having access to the exact slurm.conf file.

ofrei commented 4 years ago

It's possible to run sshare -a -l on tsd to get some more info: image But these numbers are not clear to me.

ofrei commented 4 years ago

I think it's fair to close this: queue system is documented here: https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/hpc/queue-system.html , and we have further notes here: http://norment.awiki.org/dokuwiki/tsd_hpc_sessions . If there are more specific questions please re-open.