Observability & Alerting

Note: This is issue is part of the Service Transfer Project. The goal is to ensure project documentation is up to date and help the receiving team understand what the service does and how to maintain and operate it. The previous team is primarily responsible for doing this work, and the receiving team is the stakeholder on this issue and has final approval.

These are a set of guidelines, not a rigid set of requirements. If the receiving team already has expertise on this service and is comfortable operating it, they may complete whatever subset of the tasks they find appropriate and close this issue.

The assignees on this issue are intended to be "manager of previous team" and "manager of new team" based on what's in the Service Ownership Spreadsheet. If these are incorrect please update the assignees on this issue and update the spreadsheet to match.

Observability & Alerting

Share your approach to observability and alerting with the receiving team. Link to the configured monitors and describe what they do and what it means if they fire. Perhaps schedule a meeting with the receiving team to present and discuss this information.

The standards we suggest during the Production Readiness process are here for reference:

Ensure that the service is set up with appropriate monitoring and alerting mechanisms to detect and respond to issues in a timely manner.

For monitors that go to a pager, you want to optimize for only waking someone up when human intervention is required. Try to avoid paging for noisy signals that may be false positives.

[ ] Read the Alerting section of Observability & Alerts and make sure your monitors cover the scenarios described, as a minimum.
[ ] Create monitors for this service to detect:
- [ ] Error Rate
- [ ] Latency
- [ ] low/no throughput
- [ ] Reduced capacity
- [ ] A full outage
- [ ] Any other relevant issues (e.g. long queue length)
[ ] Container-level monitors:
- [ ] container restarts
- [ ] pod restarts
- [ ] OOM kills
- [ ] minimum pods - total number of pods should be at least N+2, where N is the number of pods needed to run your service with some capacity overhead built in. Alert when number of pods is ≤ N.

Further reading: https://sre.google/sre-book/monitoring-distributed-systems/

netlify / gotrue

Observability & Alerting #354

Observability & Alerting