Open michaeljfazio opened 4 years ago
my node get stuck after awhile, the lastBlockDate field will not update for hours untill I manually stop and start my node
Software watchdog is a systemd functionality, it barely has anything to do with Jormungandr itself. There is a related component systemd-notify
that can be utilized to send events to systemd watchdog helper. I think 2 modifications are needed in Jormungandr:
Software watchdog is a systemd functionality, it barely has anything to do with Jormungandr itself.
Correct?
There is a related component systemd-notify that can be utilized to send events to systemd watchdog helper.
Yes. The systemd-notify
util calls into systemd watchdog functionality and allows scripts to interface with systemd watchdogs. Compiled applications that can link directly to sd_notify however don't need, nor should they use systemd-notify
.
Jormungandr should implement the sd_notify crate and notify systemd directly of process status events.
During startup/bootstrap, systemd would notify systemd that it is in READY=0 state. Once bootstrap completes. Jormungandr would notify systemd READY=1 state. Then, with each block received/created Jormungandr would notify systemd WATCHDOG=1
All of this via proper service calls and NOT with externally invoked systemd-notify.
stuck notifier could be interpreted as a ERROR, so systemd can automatcly restart the service
stuck notifier could be interpreted as a ERROR, so systemd can automatcly restart the service
That is a bad idea, it is always better to just know the state and act on it accordingly, it may be so that your machine lost network connectivity,... if API reports the stuck
state, you can script additional diagnostics to decide the action.
Is your feature request related to a problem/context ? Please describe if applicable.
Conditions arise where a node may become "stuck". Some of these conditions are known. For example, when a node misses too many blocks (for whatever reason), it will not recover. In the past, bugs have also been the cause of nodes deadlocking, and thus not progressing. When this happens, leader nodes miss slots and a small child dies.
Describe the solution you'd like
Systemd is an often used service manager, which is now standard on many Linux distributions. Systemd has watchdog capabilities built-in that allows software with critical code that executes with a known or desirable cadence to notify systemd of its "liveness" at periodic intervals via the sd_notify call. The systemd unit file can then be configured to respond to a "dead" service by restarting it.
Given that the timing of new block creation is well established, a systemd watchdog function can be sensibly implemented in Jormungandr.
Additional context
Provisions to build Jormungandr with systemd logging integration already exist in the codebase. This feature toggle can be conveniently used to implement systemd watchdog functionality.
Many poorly implemented mechanisms to monitor and reboot "stuck" Jormungandr processes already exist in the wild and many more are being implemented each day. If not implemented correctly these can result in nodes rebooting superfluously and indefinitely which puts strain on the network. A robust solution, such as the one described here, will help prevent the network becoming victim to nodes practicing such self-immolation.