Closed sideninja closed 3 months ago
As per @sjonpaulbrown comment:
If you are suggesting that the container would not crash, you could write a prometheus metric that could be used for alerts in Grafana. That is the ideal pattern. Where possible, we try to avoid alerting on logs, but those could also be used. We do not programmatically trigger alerts or notifications. We write telemetry data that is either pushed or pulled into grafana, and we create grafana alerts to notify individuals who are on-call.
We need to alert when an API crasher happens because we are handling the crashes gracefully (they don't crash the node) we still need to trigger an alert in Grafana so we can react and it doesn't go unnoticed. If the node crashes we will have an alert so we don't have to care for those, but for any gracefully handled crash we should trigger an alert manually.