target / goalert

Open source on-call scheduling, automated escalations, and notifications so you never miss a critical alert
https://goalert.me
Apache License 2.0
2.22k stars 240 forks source link

Alert status manager gets stuck with empty log ID #3783

Closed mastercactapus closed 6 months ago

mastercactapus commented 6 months ago

Describe the Bug: If an alert's status is updated, without a corresponding entry in the alert_logs table for any reason, the status manager gets stuck with errors, potentially blocking status updates for other alerts, and the alert in question never has things like Slack messages update.

Steps to Reproduce: Since it's an edge case, the exact steps are uncertain. But it follows this pattern:

  1. Update an alert
  2. Delete the associated log entry before the status manager has processed the alert

Expected Behavior: On failure to record a log, an error should be logged, but status updates should continue to work even in the absence of a specific alert's log.

Observed Behavior: Status updates break entirely because they stay "stuck" on the error-state alert.

Screenshots/Stack Traces:

2024-03-22 10:37:36.852 time="2024-03-22T15:37:36Z" level=error AuthSystemComponent=Engine SQLErrDetails="Failing row contains (ffffffff-ffff-ffff-ffff-ffffffffffff, alert_status_update, null, 2024-03-22 15:37:36.840009+00, pending, 2024-03-22 15:37:36.840009+00, , null, null, 0, null, null, null, 78595, null, null, null, null, null, null, 0, eeeeeeee-eeee-eeee-eeee-eeeeeeeeeeee, null, null, null)." Trigger=INTERVAL error="Engine.StatusUpdateManager: send channel status update message: ERROR: new row for relation "outgoing_messages" violates check constraint "om_status_update_log_id" (SQLSTATE 23514)

Application Version: Discovered on GoAlert with GitCommit 5e8ce484c02987983668fbeeb4a11431f4e85389.

Desktop: Not applicable as it is a backend/engine issue.

Additional Context Fix notes:

mastercactapus commented 6 months ago

It might be helpful to emit an error upon hitting the missing log entry case (as it shouldn't happen) as long as it's in a way that the error will only be logged once.

mastercactapus commented 6 months ago

Start of fix: https://github.com/target/goalert/tree/fix-status-missing-log