kftsehk / phy-clusters

11 stars 2 forks source link

Cu25 is down, job running on it won’t stop #17

Closed caicairay closed 5 years ago

caicairay commented 5 years ago

Node cu25 was down while job running on it.

zcao@mu01:~$ checkjob 2000

checking job 2000

State: Running
Creds:  user:wygong  group:jyzhu  class:normal  qos:DEFAULT
WallTime: 2:11:51:53 of 2:11:00:00
SubmitTime: Thu Dec 20 12:26:19
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Thu Dec 20 12:26:20
Total Tasks: 48

Req[0]  TaskCount: 48  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [normal]
Allocated Nodes:
[cu25-0:24][cu26-0:24]
WARNING:  allocated node           cu25-0 is in state Down

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '2000' (-2:11:58:17 -> 00:00:01  Duration: 2:11:58:18)
PE:  48.00  StartPriority:  -911

Job doesn’t stop properly , should I kill it?



zcao@mu01:~$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

2000                 wygong    Running    48    -1:01:20  Thu Dec 20 12:26:20
2190                   zcao    Running   912    00:04:40  Sun Dec 23 00:02:20

     2 Active Jobs     960 of  984 Processors Active (97.56%)
                        39 of   41 Nodes Active      (95.12%)```
kftsehk commented 5 years ago

In case it isn't his time, you can kill, these kind of executable probably doesn't produce any output too.

Related to #8