Closed caicairay closed 6 years ago
There are signs of job buffer full, but maui seem to continue normally after the error.
/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR: job buffer is full (ignoring job '68028.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 WARNING: job buffer overflow (cannot add job '68029')
/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR: job buffer is full (ignoring job '68029.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 WARNING: job buffer overflow (cannot add job '68030')
/usr/local/maui/log/maui.log.1:09/03 20:28:10 ERROR: job buffer is full (ignoring job '68030.mu01')
/usr/local/maui/log/maui.log.1-09/03 20:28:10 INFO: 5667 PBS jobs detected on RM MU01
/usr/local/maui/log/maui.log.1-09/03 20:28:10 INFO: jobs detected: 5667
/usr/local/maui/log/maui.log.1-09/03 20:28:10 MStatClearUsage(node,Active)
The log seem to short and bulky for debug purpose, lowered log level to 2 and increased log size to 300MB
New issue relevant to maui:
zcao@mu01 ~$:showq
ERROR: lost connection to server
ERROR: cannot request service (status)
Found a potential solution: http://linuxtoolkit.blogspot.com/2015/10/maui-secondary-client-lost-connection.html
The time difference between mu and cu is about 2 min
zcao@mu01 ~$:ssh cu01 date +"%T";date +"%T"
14:55:44
14:57:54
Reset the time
zcao@mu01 sbin$:ssh cu01 date +"%T";date +"%T"
15:08:15
15:08:15
Restart the maui daemon.
zcao@mu01 sbin$:sudo service maui restart
Shutting down MAUI Scheduler: [ OK ]
Starting MAUI Scheduler: [ OK ]
Maui back to work:
zcao@mu01 sbin$:showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
68030 whsze Running 24 10:00:00 Tue Sep 4 15:08:01
39477 jzzhang Running 48 1:06:00:00 Tue Sep 4 15:08:01
39478 jzzhang Running 48 1:06:00:00 Tue Sep 4 15:08:01
39479 jzzhang Running 48 1:06:00:00 Tue Sep 4 15:08:01
39480 jzzhang Running 48 1:06:00:00 Tue Sep 4 15:08:01
68031 groupjfwang Running 32 1:16:00:00 Tue Sep 4 15:08:01
39472 gyding Running 96 1:23:59:58 Tue Sep 4 15:07:59
68032 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
68033 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
68034 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
68035 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
68036 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
68037 sylau Running 48 2:00:00:00 Tue Sep 4 15:08:01
39474 jzzhang Running 72 2:10:59:58 Tue Sep 4 15:07:59
39475 jzzhang Running 72 2:10:59:58 Tue Sep 4 15:07:59
68020 cksin Running 96 2:11:00:00 Tue Sep 4 15:08:01
68021 cksin Running 96 2:11:00:00 Tue Sep 4 15:08:01
68028 nnli Running 24 2:11:58:58 Tue Sep 4 15:07:59
68029 nnli Running 24 2:11:59:00 Tue Sep 4 15:08:01
19 Active Jobs 1016 of 1056 Processors Active (96.21%)
43 of 44 Nodes Active (97.73%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
68022 cksin Idle 48 2:00:00:00 Mon Sep 3 14:54:52
68023 cksin Idle 48 2:00:00:00 Mon Sep 3 14:55:03
68024 cksin Idle 48 2:00:00:00 Mon Sep 3 14:55:14
68025 cksin Idle 48 2:00:00:00 Mon Sep 3 14:55:27
68026 cksin Idle 48 2:00:00:00 Mon Sep 3 14:55:39
68027 cksin Idle 48 2:00:00:00 Mon Sep 3 14:55:48
6 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
39432 wyliu Idle 480 4:23:59:00 Wed Aug 29 17:18:14
Total Jobs: 26 Active Jobs: 19 Idle Jobs: 6 Blocked Jobs: 1
[root@mu01 ~]# pdsh -w^hosts chkconfig --list ntpd
gateway: ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
io01: ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
[root@mu01 ~]# chkconfig --list ntpd
ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Somehow the time sync daemon is turned off in management nodes, have re-enabled.
[root@mu01 ~]# chkconfig ntpd on
[root@mu01 ~]# service ntpd restart
Shutting down ntpd: [FAILED]
Starting ntpd: [ OK ]
I hope this is solved, reopen if it happens again.
Considering the reason is the overflow of job buffer. The exact submitting timestamp is
Sat Sep 1 23:54:38 2018