Closed Schnitzel closed 1 year ago
These alerting systems are currently outside the scope of Lagoon, but could be handled at the cluster level with Prometheus?
We can look at adding a metrics endpoint to the controller though
v0.4.1 of the controller has a new metrics endpoint, but the metrics are pretty basic at the moment but might be enough to check for consistent failures over time as there is a counter (increment) for total build failures
Closing, as the metrics endpoint exists. If it doesn't provide suitable information for alerting, then we can extend the metrics provided to try and cover what is required.
Single failed builds will not be very informational to alert about, as also code issues can cause failed deployments. But we could try to implement a logic in the system that realizes if all started builds in the last 15mins have all failed, which would point to a more infrastructure issue than an individual environment issue.