turing-machines / BMC-Firmware

Turing-pi BMC firmware
GNU General Public License v2.0
215 stars 26 forks source link

Basic node health-monitoring from the BMC #170

Open srcshelton opened 5 months ago

srcshelton commented 5 months ago

Is your feature request related to a problem? Please describe.

As an end-user of a TP2 system, I would like basic health-monitoring to be automatically performed by the BMC and report via the Web UI and any future alerting system (see #153) so that I can be made aware if a node has experienced a problem.

Describe the solution you'd like

I'm not sure what access the BMC has to the built-in switch, but realising that most monitoring of any level of sophistication will likely require an agent running on each node's host OS, I'm wondering what we can do at a basic level without any further access to the node.

For example:

Describe alternatives you've considered

Having a standard/default TP2 monitoring agent which specifically reports to a service running on the BMC would be great too, but that's a separate enhancement request ;)

Additional context

(We might get much of this by including monit or similar in the BMC default packages, but there'd then want to be a way to plug this into the BMC Web UI)

srcshelton commented 5 months ago

If a node is detected as failing, there could also be an option to automatically power-cycle it… although there should also be a configurable limit on retries, so that a node can't get stuck in a reboot loop.