Heartbeat (Webhook) stuck after opsgenie connection issue

D3luxee commented 6 months ago

What did you do? Recently we started to get regular opsgenie heartbeat expired alerts. The logs of alertmanager indicated that there was an interim issue connecting to opsgenie. Maybe opsgenie is less reliable recently which revealed an bug that exists for a long time.

Alertmanager stopped sending heartbeats/alerts via the webhook integration at all after this issue. Restarting Alertmanager solved the issue. We've also observed the same issue using webhook on another setup that is unrealted to opsgenie.

What did you expect to see? Alertmanager webhook integration should recover from connection issues. What did you see instead? Under which circumstances? Alertmanager webhook integration did not recover from the issue itself and required a restart to recover. Environment kube-prometheus-stack, victoria metrics and alertmanager version 0.25.0 and 0.26.0 are affected

System information:

Kubernetes / GKE and Rancher / RKE2
Alertmanager version:

Initially we observed the issue with alertmanager 0.25.0 and upgraded to 0.26.0 hoping to solve the issue. But 0.26.0 showed the exact same error. version="(version=0.25.0, branch=HEAD, revision=258fab7cdd551f2cf251ed0348f0ad7289aee789) version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)"
Prometheus version:

Affected promtheus and victoria metrics setups.

Logs:

ts=2024-01-05T20:52:00.263Z caller=notify.go:757 level=info component=dispatcher receiver=opsgenie.heartbeat integration=webhook[0] aggrGroup="{}/{alertname=~\"Watchdog|InfoInhibitor\"}:{alertname=\"Watchdog\", cluster=\"redacted by me\"}" msg="Notify success" attempts=2
ts=2024-01-05T20:51:59.463Z caller=notify.go:745 level=warn component=dispatcher receiver=opsgenie.heartbeat integration=webhook[0] aggrGroup="{}/{alertname=~\"Watchdog|InfoInhibitor\"}:{alertname=\"Watchdog\", cluster=\"redacted by me\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": read tcp 172.16.0.20:44786->52.84.251.74:443: read: connection reset by peer"

Sometimes these logs even contain an http status page returned by opsgenie but that would be too noisy to post here.

subhashgehlot commented 3 months ago

We are facing similar issue with this, we are frequently getting heartbeat expiry and goes of after the alertmanager restart. Identified problem seems to be the TCP connection is not getting closed after checking the heartbeat. So, it is not connecting to the available connections at that moment but checks with the existing connection which is kept alive and that connection might be problematic.

alertmanager version 0.25.0

grobinson-grafana commented 3 months ago

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

subhashgehlot commented 3 months ago

No, issue persists and it needs alert manager restart to solve the problem.

On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.***> wrote:

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

— Reply to this email directly, view it on GitHub https://github.com/prometheus/alertmanager/issues/3669#issuecomment-2061841847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.***>

grobinson-grafana commented 3 months ago

No, issue persists and it needs alert manager restart to solve the problem. … On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.> wrote: Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted? — Reply to this email directly, view it on GitHub <#3669 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.>

You waited 5 minutes?

subhashgehlot commented 3 months ago

Yes, waited for more than 5 minutes and then restarted.

prometheus / alertmanager

Heartbeat (Webhook) stuck after opsgenie connection issue #3669