kftsehk / phy-clusters

11 stars 2 forks source link

computational nodes went down when job was submitted #36

Open caicairay opened 4 years ago

caicairay commented 4 years ago

The computational node will lose connection after some specific job (request 20 nodes) was assigned. The assigned job was rejected and re-queue many times and BatchHold in the end.

zcao@mu01:server_logs$ tracejob 72967 -n 4
Job: 72967.mu01
02/22/2020 05:53:50  S    enqueuing into extended, state 1 hop 1
02/22/2020 05:53:50  A    queue=extended
02/23/2020 00:01:32  S    unable to run job, MOM rejected/timeout
02/23/2020 00:01:32  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:01:33  S    Job Run at request of root@mu01
02/23/2020 00:06:33  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:11:34  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:16:35  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:21:36  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:26:37  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:31:38  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:36:39  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:41:40  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:46:41  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:51:42  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:56:43  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:01:44  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:06:45  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:11:46  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:16:47  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:21:48  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:26:49  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:31:50  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:36:51  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:41:52  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:46:53  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:51:54  S    unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:56:55  S    unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 02:01:56  S    unable to run job, send to MOM '11.11.11.2' failed
02/25/2020 20:45:19  S    Job deleted at request of zcao@mu01
02/25/2020 20:45:19  A    requestor=zcao@mu01
02/25/2020 21:00:27  S    on_job_exit valid pjob: 72967.mu01 (substate=59)
02/25/2020 21:15:29  S    dequeuing from extended, state COMPLETE

similar issue happend on job 72823.mu01. Is it related to #8 ? Thanks!

caicairay commented 4 years ago

Possible configuration reason: It should be mu01 in /var/lib/torque/server_name, but localhost was found.

Need to reconfigure after nodes back online.

kftsehk commented 4 years ago

The pbs daemon config is in

 ~]# cat /var/lib/torque/mom_priv/config
# Configuration for pbs_mom.
$pbsserver mu01