prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.46k stars 2.12k forks source link

Heartbeat (Webhook) stuck after opsgenie connection issue #3669

Open D3luxee opened 6 months ago

D3luxee commented 6 months ago

What did you do? Recently we started to get regular opsgenie heartbeat expired alerts. The logs of alertmanager indicated that there was an interim issue connecting to opsgenie. Maybe opsgenie is less reliable recently which revealed an bug that exists for a long time.

Alertmanager stopped sending heartbeats/alerts via the webhook integration at all after this issue. Restarting Alertmanager solved the issue. We've also observed the same issue using webhook on another setup that is unrealted to opsgenie.

What did you expect to see? Alertmanager webhook integration should recover from connection issues. What did you see instead? Under which circumstances? Alertmanager webhook integration did not recover from the issue itself and required a restart to recover. Environment kube-prometheus-stack, victoria metrics and alertmanager version 0.25.0 and 0.26.0 are affected

subhashgehlot commented 3 months ago

We are facing similar issue with this, we are frequently getting heartbeat expiry and goes of after the alertmanager restart. Identified problem seems to be the TCP connection is not getting closed after checking the heartbeat. So, it is not connecting to the available connections at that moment but checks with the existing connection which is kept alive and that connection might be problematic.

alertmanager version 0.25.0

grobinson-grafana commented 3 months ago

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

subhashgehlot commented 3 months ago

No, issue persists and it needs alert manager restart to solve the problem.

On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.***> wrote:

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

— Reply to this email directly, view it on GitHub https://github.com/prometheus/alertmanager/issues/3669#issuecomment-2061841847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.***>

grobinson-grafana commented 3 months ago

No, issue persists and it needs alert manager restart to solve the problem. On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.> wrote: Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted? — Reply to this email directly, view it on GitHub <#3669 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.>

You waited 5 minutes?

subhashgehlot commented 3 months ago

Yes, waited for more than 5 minutes and then restarted.