olafz / percona-clustercheck

Script to make a proxy (ie HAProxy) capable of monitoring Percona XtraDB Cluster nodes properly. The clustercheck script is distributed under the BSD license.
BSD 3-Clause "New" or "Revised" License
173 stars 108 forks source link

Systemd with Clusterchk not longer working #35

Closed dcz010 closed 2 months ago

dcz010 commented 9 months ago

Hi,

Could there be an problem with the clusterchk.socket and systemd by filling up the system with so many file descriptors that the system can't no longer handle it?

Here are some syslogs from a debian vm:

xxxxxxxxxxxxx@server1:~$ sudo systemctl start reboot.target
Failed to start reboot.target: Argument list too long
See system logs and 'systemctl status reboot.target' for details.
xxxxxxxxxxxxxx@server1:~$ sudo systemctl status reboot.target
Failed to get properties: Unknown object '/org/freedesktop/systemd1/unit/reboot_2etarget'.
xxxxxxxxxxxxxx@server1:~$ sudo systemctl reboot
Failed to reboot system via logind: Invalid request descriptor
Failed to start reboot.target: Argument list too long
See system logs and 'systemctl status reboot.target' for details.

xxxxxxxxxxx@server1:~$ sudo journalctl --unit dbus
-- Journal begins at Tue 2023-11-21 22:04:31 CET, ends at Mon 2023-11-27 11:34:07 CET. --
Nov 27 07:31:47 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:32:12 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:32:37 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:33:02 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:33:27 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:33:52 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:34:17 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:34:42 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:35:07 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:35:32 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:35:57 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:36:22 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:36:47 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:37:12 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:37:37 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:38:02 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:38:27 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:38:52 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:39:17 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:39:42 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:40:07 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:40:32 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:40:57 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:41:22 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:41:47 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:42:12 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:42:37 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:43:02 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:43:27 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:43:52 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:44:17 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:44:42 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:45:07 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:45:32 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:45:57 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:46:22 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:46:47 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:47:12 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:47:37 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:48:02 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:48:27 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:48:52 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:49:17 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:49:42 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:50:07 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:50:32 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:50:57 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 07:51:19 server1 dbus-daemon[477]: [system] Successfully activated service 'org.freedesktop.systemd1'
Nov 27 08:15:03 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 08:15:28 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 08:15:53 server1 dbus-daemon[477]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Nov 27 08:16:02 server1 dbus-daemon[477]: [system] Successfully activated service 'org.freedesktop.systemd1'
-- Boot 29a3a40650fe42d4a37bdce0bdf2b71d --
Nov 27 08:19:11 server1 systemd[1]: Started D-Bus System Message Bus.
Nov 27 08:19:28 server1 systemd[1]: Stopping D-Bus System Message Bus...
Nov 27 08:19:28 server1 systemd[1]: dbus.service: Succeeded.
Nov 27 08:19:28 server1 systemd[1]: Stopped D-Bus System Message Bus.
-- Boot 363a9b051c5d4b0289eec63faec54352 --
Nov 27 08:19:45 server1 systemd[1]: Started D-Bus System Message Bus.
xxxxxxxxxxxx@server1:~$ sudo journalctl --unit clusterchk.socket
-- Journal begins at Tue 2023-11-21 22:04:31 CET, ends at Mon 2023-11-27 11:35:11 CET. --
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Failed to queue service startup job (Maybe the service file is missing or not a template unit?): Argument list too long
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Failed with result 'resources'.
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Consumed 26min 57.357s CPU time.
Nov 27 07:51:19 server1 systemd[1]: Listening on Clusterchk socket.
Nov 27 07:51:32 server1 systemd[1]: clusterchk.socket: Failed to queue service startup job (Maybe the service file is missing or not a template unit?): Argument list too long
Nov 27 07:51:32 server1 systemd[1]: clusterchk.socket: Failed with result 'resources'.
-- Boot 29a3a40650fe42d4a37bdce0bdf2b71d --
Nov 27 08:19:11 server1 systemd[1]: Listening on Clusterchk socket.
Nov 27 08:19:33 server1 systemd[1]: clusterchk.socket: Succeeded.
Nov 27 08:19:33 server1 systemd[1]: Closed Clusterchk socket.
-- Boot 363a9b051c5d4b0289eec63faec54352 --
Nov 27 08:19:45 server1 systemd[1]: Listening on Clusterchk socket.

Nov 19 01:38:25 server1 systemd[1]: Started Check the status of Galera/MySQL (xxx.xxx.xxx.xxx:33982).
Nov 19 01:38:25 server1 systemd[1]: Started Check the status of Galera/MySQL (xxx.xxx.xxx.xxx:44600).
Nov 19 01:38:25 server1 clusterchk.sh[4010546]: /bin/echo: write error: Connection reset by peer
Nov 19 01:38:25 server1 clusterchk.sh[4010547]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 clusterchk.sh[4010548]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 clusterchk.sh[4010549]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 systemd[1]: clusterchk@92030-xxx.xxx.xxx.xxx:9999-xxx.xxx.xxx.xxx:33982.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 01:38:25 server1 systemd[1]: clusterchk@92030-xxx.xxx.xxx.xxx:9999-xxx.xxx.xxx.xxx:33982.service: Failed with result 'exit-code'.
Nov 19 01:38:25 server1 clusterchk.sh[4010555]: /bin/echo: write error: Connection reset by peer
Nov 19 01:38:25 server1 clusterchk.sh[4010556]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 clusterchk.sh[4010557]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 clusterchk.sh[4010558]: /bin/echo: write error: Broken pipe
Nov 19 01:38:25 server1 systemd[1]: clusterchk@92031-xxx.xxx.xxx.xxx:9999-xxx.xxx.xxx.xxx:44600.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 01:38:25 server1 systemd[1]: clusterchk@92031-xxx.xxx.xxx.xxx:9999-xxx.xxx.xxx.xxx:44600.service: Failed with result 'exit-code'.

Nov 25 21:18:48 server1 systemd[1]: cannot add name, manager has too many units: Argument list too long
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Failed to queue service startup job (Maybe the service file is missing or not a template unit?): Argument list too long
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Failed with result 'resources'.
Nov 25 21:18:48 server1 systemd[1]: clusterchk.socket: Consumed 26min 57.357s CPU time.
Nov 26 00:00:08 server1 systemd[1]: cannot add name, manager has too many units: Argument list too long
Nov 26 00:00:08 server1 systemd[1]: cannot add name, manager has too many units: Argument list too long
Nov 26 00:00:08 server1 systemd[1]: cannot add name, manager has too many units: Argument list too long

It leads to an very slow system without the possibility to reboot (only hard reset at vm level possible).

Greetings dcz010

matejzero commented 2 months ago

Show your systemd unit and socket file.

I had a problem when I was using dynamic users with systemd. Systemd had a bug where it wasnt cleaning temporary folders and files and I ran out of inodes.

https://github.com/systemd/systemd/issues/28271

dcz010 commented 2 months ago

Hi, Thanks for your reply. Well i found the solution after long searching myself. The problem was, that the sockets weren't closed after and successful or failed connection and then it had an buffer overflow. I had to change the clusterchk@.service like that:

[Unit]
Description=Check the status of Galera/MySQL
After=mysql.service
Requires=clusterchk.socket

[Service]
Type=simple
ExecStart=-/usr/local/bin/clusterchk.sh
TimeoutStopSec=5
StandardInput=socket
#StandardError=journal
#Restart=on-failure

[Install]
WantedBy=multi-user.target