Open JannikBrand opened 6 months ago
When querying the alerts via GET _plugins/_alerting/monitors/alerts?monitorId=<id_placeholder>
, this is a sample output (with OS 2.12.):
The monitor was running 1 day, many alerts are send successfully while other alerts are not send due to the connection reset error.
I also tried to run the monitor in 20 min intervals instead of 2hours. Almost every second alert is not sent.
{
"alerts": [
{
"id": "sJSsBo8BomAnsTMDF0PN",
"version": 11,
"monitor_id": "kHM-Bo8Bgaj1Kgd4LjPY",
"workflow_id": "",
"workflow_name": "",
"associated_alert_ids": [],
"schema_version": 5,
"monitor_version": 1,
"monitor_name": "My Test Monitor",
"execution_id": "kHM-Bo8Bgaj1Kgd4LjPY_2024-04-22T16:38:09.502091053_6c5c58a6-3eaf-4283-ae57-88801bb703f4",
"trigger_id": "jnM-Bo8Bgaj1Kgd4LTP6",
"trigger_name": "My-Test-Trigger",
"finding_ids": [],
"related_doc_ids": [],
"state": "ACTIVE",
"error_message": null,
"alert_history": [
{
"timestamp": 1713868689538,
"message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
},
{
"timestamp": 1713854289430,
"message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
},
{
"timestamp": 1713839889444,
"message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
},
{
"timestamp": 1713825489518,
"message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
},
{
"timestamp": 1713811089532,
"message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
}
],
"severity": "1",
"action_execution_results": [
{
"action_id": "notification167289",
"last_execution_time": 1713875890185,
"throttled_count": 0
}
],
"start_time": 1713803892665,
"last_notification_time": 1713875890370,
"end_time": null,
"acknowledged_time": null
}
],
"totalAlerts": 1
}
added to backlog.
What is the bug? After setting up an Alert Monitor which is sending alerts to a custom webhook and verifying that in general the alerts are sent to the webhook successfully, alerts are very often not being sent due to a connection reset issue after some time has passed.
Example: When editing a Monitor there is the possibility to "Send test message". When creating the monitor sending the test message works. After waiting ~30-60 mins it fails with status code 500 and "Failed to send webhook message Connection reset" (which you can see in the browser developer tools and the OpenSearch Dashboard notification popup in that case). But also if you just wait for the actual alerts being sent by the monitor the often result in this failed state.
When getting the "connection reset" error using the "Send test message" button, the first retry always succeeded. First idea: Improve / Introduce retry logic in the OpenSearch client when this happens.
It could have something todo with the monitor setup, not sure, I will share the monitor information below. With this configuration I could reproduce it for OpenSearch 1.3.15 and 2.12.0. It occurred for us when sending alerts to both, MS Teams and Slack as well as for the destination types slack and custom webhook.
Assumption: The issue seems to be on the OpenSearch side since otherwise there would also be the error message "Connection reset by peer".
Respective log of the OpenSearch data node:
How can one reproduce the bug? Steps to reproduce the behavior:
Create a Monitor:
What is the expected behavior? If the connection gets reset, I expect that OpenSearch will take care of re-establishing the connection and/or retry the requests.
What is your host/environment?
Do you have any screenshots?
Do you have any additional context?