szcf-weiya / techNotes

My notes about technology.
https://tech.hohoweiya.xyz/
11 stars 7 forks source link

fail to request job on the chpc partition #7

Closed szcf-weiya closed 3 years ago

szcf-weiya commented 3 years ago

I have 14 jobs (10 are running, and 4 are in the queue) on the stat partition, but no jobs on the chpc partition, then I tried to submit a job to the chpc partition, but I failed with the following error

$ salloc -N1
salloc: error: QOSMaxSubmitJobPerUserLimit
salloc: error: Job submit/allocate failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

I am quite confused, how can I exceed the job limit on the chpc partition?! Accidently, the max number of jobs on chpc partition is 10, I am guessing whether chpc partition count my jobs on the stat partition

$ sacctmgr show qos format=name,MaxJobsPU,MaxSubmitPU,MaxTRESPU
      Name MaxJobsPU MaxSubmitPU     MaxTRESPU 
---------- --------- ----------- ------------- 
    normal                    10               
      stat                    30        cpu=30 
    20jobs                    20               
        p1                    10               
        p2                    10               
        p3                    10               
      hold         0          10               
    tfchan                    10               
      bull                    50               
      ligo                   100               
      demo                    10               
yingyingw+                    30        cpu=30 
     bzhou                    10               
       geo                    10               
     cstat                              cpu=16 

and I have checked the QOS type

$ cat /etc/slurm/slurm.conf 
...
# Partition
PartitionName=chpc Nodes=chpc-cn[002-029,033-040,042-050],chpc-gpu[001-003],chpc-k80gpu[001-002],chpc-large-mem01,chpc-m192a[001-010] Default=YES MaxTime=7-0 MaxNodes=16 State=UP DenyAccounts=public AllowQos=normal,cstat,tfchan,ligo #DenyQos=stat,bull,demo
PartitionName=public Nodes=chpc-cn[005,015,025,035,045],chpc-gpu001 MaxTime=7-0 State=UP 
PartitionName=stat Nodes=chpc-cn[101-110],chpc-gpu[010-014] State=UP AllowAccounts=stat QOS=stat
PartitionName=yingyingwei Nodes=chpc-cn111,chpc-gpu015 State=UP AllowGroups=yingyingwei QOS=yingyingwei
PartitionName=bzhou Nodes=chpc-gpu[004-009] State=UP AllowAccounts=boleizhou QOS=bzhou
PartitionName=tjonnie Nodes=chpc-cn[030-032,041] State=UP AllowGroups=s1155137381 QOS=ligo
#PartitionName=ligo Nodes=chpc-cn050 State=UP AllowAccounts=tjonnieli QOS=ligo
#PartitionName=demo Nodes=chpc-cn049 State=UP AllowAccounts=pione QOS=demo
#PartitionName=geo Nodes=chpc-cn048 State=UP AllowGroups=s1155102420 QOS=geo
PartitionName=itsc Nodes=ALL State=UP AllowAccounts=pione QOS=bull Hidden=yes 

does AllowQos=normal,cstat,tfchan,ligo of chpc cause the above behavior.

szcf-weiya commented 3 years ago

Ohhh, I got it! Recently I just used -p stat -q stat, where -q specify the qos, the default is normal instead of stat, so the jobs would be counted into chpc's quota.