ucsf-wynton / wynton-website-hpc

The Official Wynton HPC User Website
https://wynton.ucsf.edu/hpc/
2 stars 14 forks source link

DOCS: Make it clear that 'Eqw' jobs remains until error is fixed or job is qdel:ed #120

Open HenrikBengtsson opened 1 year ago

HenrikBengtsson commented 1 year ago

There are several left-over jobs in error state (Eqw) that just sits in the queue.

$ qstat -u '*' | grep -E "\bEqw\b" | wc -l
302

$ qstat -u '*' | grep -E "\bEqw\b" | head
 999111 0.27944 dscpileup. alice     Eqw  01/06/2023 22:13:21    1
1123487 0.08937 GEXA7      bob       Eqw  01/17/2023 19:14:18   16
1123493 0.08937 GEXB1      bob       Eqw  01/17/2023 19:15:18   16
1123501 0.08937 GEXA7      bob       Eqw  01/17/2023 19:31:23   16
1123517 0.08875 GEXA7      bob       Eqw  01/17/2023 19:45:46   16
 970748 0.08045 nf-DADA2_A charlie   Eqw  03/16/2023 22:27:33   16
 971176 0.08007 nf-DADA2_A charlie   Eqw  03/16/2023 22:32:03   16
2423870 0.07663 dask-worke carol     Eqw  04/06/2023 14:32:48    1
2424271 0.07663 dask-worke carol     Eqw  04/06/2023 15:21:23    1
2423840 0.07662 dask-worke carol     Eqw  04/06/2023 14:32:47    1
...

At a minimum, we should explain that these jobs stay in the queue forever, unless the underlying error gets fixed, or the job is qdel:ed by the user.

PS. SGE keeps spending time on these jobs over and over.