Closed caicairay closed 5 years ago
@caicairay Please note #15 #10 #5, we will fix #5 ASAP
Hi @kftsehk. Somehow every time I submit a job, I met this problem: the job would be deferred for a while (about a minute), then sent to execution.
State: Idle EState: Deferred
Creds: user:zcao group:hbli class:extended qos:DEFAULT
WallTime: 00:00:00 of 2:00:00
SubmitTime: Mon Nov 19 20:50:48
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
Total Tasks: 960
Req[0] TaskCount: 960 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [normal]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15085, msg: 'End of File')
Holds: Defer (hold reason: RMFailure)
PE: 960.00 StartPriority: -12355
cannot select job 149809 for partition DEFAULT (job hold active)
It could be TORQUE Mom temporarily become unresponsive, thus cannot receive the job start. https://torqueusers.supercluster.narkive.com/ZXAzm4jP/rm-failure-rc-15085-msg-end-of-file
Possibly related to #8, or launching job with large number of nodes
All nodes works now.
zcao@mu01:~$ date
Wed Dec 19 23:58:34 CST 2018
zcao@mu01:~$ pbsnodes -l
zcao@mu01:~$
Can we close this?
One node is not available for job-execution. Seems cu01 is for debug only.
Need to make cu09 and/or cu39 online to continue the simulation.