kftsehk / phy-clusters

11 stars 2 forks source link

Maui automatically turned off #9

Closed caicairay closed 6 years ago

caicairay commented 6 years ago

Considering the reason is the overflow of job buffer. The exact submitting timestamp is Sat Sep 1 23:54:38 2018

From adm@mu01.cluster2  Sat Sep  1 23:54:38 2018
Return-Path: <adm@mu01.cluster2>
X-Original-To: hjjiang@mu01
Delivered-To: hjjiang@mu01.cluster2
Received: from mu01 (localhost [127.0.0.1])
        by mu01.cluster2 (Postfix) with ESMTP id 32ABE941B55
        for <hjjiang@mu01>; Sat,  1 Sep 2018 23:54:38 +0800 (CST)
Received: (from root@localhost)
        by mu01 (8.14.4/8.14.4/Submit) id w81FscO9002523
        for hjjiang@mu01; Sat, 1 Sep 2018 23:54:38 +0800
Date: Sat, 1 Sep 2018 23:54:38 +0800
From: adm <adm@mu01.cluster2>
Message-Id: <201809011554.w81FscO9002523@mu01>
To: hjjiang@mu01.cluster2
Subject: PBS JOB 39520.mu01
Precedence: bulk

PBS Job Id: 39520.mu01
Job Name:   zm-0.3-0.3-51
Exec host:  cu05-0/0
Begun execution
kftsehk commented 6 years ago

There are signs of job buffer full, but maui seem to continue normally after the error.

/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR:    job buffer is full  (ignoring job '68028.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 WARNING:  job buffer overflow (cannot add job '68029')
/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR:    job buffer is full  (ignoring job '68029.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 WARNING:  job buffer overflow (cannot add job '68030')
/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR:    job buffer is full  (ignoring job '68030.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 INFO:     5667 PBS jobs detected on RM MU01
/usr/local/maui/log/maui.log.1-09/03 20:28:10 INFO:     jobs detected: 5667
/usr/local/maui/log/maui.log.1-09/03 20:28:10 MStatClearUsage(node,Active)

The log seem to short and bulky for debug purpose, lowered log level to 2 and increased log size to 300MB

caicairay commented 6 years ago

New issue relevant to maui:

zcao@mu01 ~$:showq
ERROR:    lost connection to server
ERROR:    cannot request service (status)

Found a potential solution: http://linuxtoolkit.blogspot.com/2015/10/maui-secondary-client-lost-connection.html

The time difference between mu and cu is about 2 min

zcao@mu01 ~$:ssh cu01 date +"%T";date +"%T"
14:55:44
14:57:54
caicairay commented 6 years ago

Reset the time

zcao@mu01 sbin$:ssh cu01 date +"%T";date +"%T"
15:08:15
15:08:15

Restart the maui daemon.

zcao@mu01 sbin$:sudo service maui restart
Shutting down MAUI Scheduler:                              [  OK  ]
Starting MAUI Scheduler:                                   [  OK  ]

Maui back to work:

zcao@mu01 sbin$:showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

68030                 whsze    Running    24    10:00:00  Tue Sep  4 15:08:01
39477               jzzhang    Running    48  1:06:00:00  Tue Sep  4 15:08:01
39478               jzzhang    Running    48  1:06:00:00  Tue Sep  4 15:08:01
39479               jzzhang    Running    48  1:06:00:00  Tue Sep  4 15:08:01
39480               jzzhang    Running    48  1:06:00:00  Tue Sep  4 15:08:01
68031              groupjfwang    Running    32  1:16:00:00  Tue Sep  4 15:08:01
39472                gyding    Running    96  1:23:59:58  Tue Sep  4 15:07:59
68032                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
68033                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
68034                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
68035                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
68036                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
68037                 sylau    Running    48  2:00:00:00  Tue Sep  4 15:08:01
39474               jzzhang    Running    72  2:10:59:58  Tue Sep  4 15:07:59
39475               jzzhang    Running    72  2:10:59:58  Tue Sep  4 15:07:59
68020                 cksin    Running    96  2:11:00:00  Tue Sep  4 15:08:01
68021                 cksin    Running    96  2:11:00:00  Tue Sep  4 15:08:01
68028                  nnli    Running    24  2:11:58:58  Tue Sep  4 15:07:59
68029                  nnli    Running    24  2:11:59:00  Tue Sep  4 15:08:01

    19 Active Jobs    1016 of 1056 Processors Active (96.21%)
                        43 of   44 Nodes Active      (97.73%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

68022                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:54:52
68023                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:55:03
68024                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:55:14
68025                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:55:27
68026                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:55:39
68027                 cksin       Idle    48  2:00:00:00  Mon Sep  3 14:55:48

6 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

39432                 wyliu       Idle   480  4:23:59:00  Wed Aug 29 17:18:14

Total Jobs: 26   Active Jobs: 19   Idle Jobs: 6   Blocked Jobs: 1
kftsehk commented 6 years ago
[root@mu01 ~]# pdsh -w^hosts chkconfig --list ntpd
gateway: ntpd                   0:off   1:off   2:off   3:off   4:off   5:off   6:off
io01: ntpd              0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@mu01 ~]# chkconfig --list ntpd
ntpd            0:off   1:off   2:off   3:off   4:off   5:off   6:off

Somehow the time sync daemon is turned off in management nodes, have re-enabled.

[root@mu01 ~]# chkconfig ntpd on
[root@mu01 ~]# service ntpd restart
Shutting down ntpd:                                        [FAILED]
Starting ntpd:                                             [  OK  ]
kftsehk commented 6 years ago

I hope this is solved, reopen if it happens again.