kftsehk / phy-clusters

11 stars 2 forks source link

node not available #16

Closed caicairay closed 5 years ago

caicairay commented 5 years ago

One node is not available for job-execution. Seems cu01 is for debug only.

Need to make cu09 and/or cu39 online to continue the simulation.

zcao@mu01:iso_sg_off_highres$ checkjob 149761

checking job 149761

State: Idle
Creds:  user:zcao  group:hbli  class:extended  qos:DEFAULT
WallTime: 00:00:00 of 1:22:00:00
SubmitTime: Sun Nov 18 00:00:12
  (Time Queued  Total: 00:01:25  Eligible: 00:01:25)

Total Tasks: 1008

Req[0]  TaskCount: 1008  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [normal]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  1008.00  StartPriority:  -9489
job cannot run in partition DEFAULT (idle procs do not meet requirements : 984 of 1008 procs found)
idle procs: 1008  feasible procs: 984

Rejection Reasons: [Features     :    1][State        :    2]
kftsehk commented 5 years ago

@caicairay Please note #15 #10 #5, we will fix #5 ASAP

caicairay commented 5 years ago

Hi @kftsehk. Somehow every time I submit a job, I met this problem: the job would be deferred for a while (about a minute), then sent to execution.

State: Idle  EState: Deferred
Creds:  user:zcao  group:hbli  class:extended  qos:DEFAULT
WallTime: 00:00:00 of 2:00:00
SubmitTime: Mon Nov 19 20:50:48
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

Total Tasks: 960

Req[0]  TaskCount: 960  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [normal]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15085, msg: 'End of File')
Holds:    Defer  (hold reason:  RMFailure)
PE:  960.00  StartPriority:  -12355
cannot select job 149809 for partition DEFAULT (job hold active)
kftsehk commented 5 years ago

It could be TORQUE Mom temporarily become unresponsive, thus cannot receive the job start. https://torqueusers.supercluster.narkive.com/ZXAzm4jP/rm-failure-rc-15085-msg-end-of-file

Possibly related to #8, or launching job with large number of nodes

caicairay commented 5 years ago

All nodes works now.

zcao@mu01:~$ date
Wed Dec 19 23:58:34 CST 2018
zcao@mu01:~$ pbsnodes -l
zcao@mu01:~$ 

Can we close this?