opensearch-project / alerting

📟 Get notified when your data meets certain conditions by setting up monitors, alerts, and notifications
https://opensearch.org/docs/latest/monitoring-plugins/alerting/index/
Apache License 2.0
60 stars 102 forks source link

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

Open JannikBrand opened 5 months ago

JannikBrand commented 5 months ago

What is the bug? After setting up an Alert Monitor which is sending alerts to a custom webhook and verifying that in general the alerts are sent to the webhook successfully, alerts are very often not being sent due to a connection reset issue after some time has passed.

Example: When editing a Monitor there is the possibility to "Send test message". When creating the monitor sending the test message works. After waiting ~30-60 mins it fails with status code 500 and "Failed to send webhook message Connection reset" (which you can see in the browser developer tools and the OpenSearch Dashboard notification popup in that case). But also if you just wait for the actual alerts being sent by the monitor the often result in this failed state.

When getting the "connection reset" error using the "Send test message" button, the first retry always succeeded. First idea: Improve / Introduce retry logic in the OpenSearch client when this happens.

It could have something todo with the monitor setup, not sure, I will share the monitor information below. With this configuration I could reproduce it for OpenSearch 1.3.15 and 2.12.0. It occurred for us when sending alerts to both, MS Teams and Slack as well as for the destination types slack and custom webhook.

Assumption: The issue seems to be on the OpenSearch side since otherwise there would also be the error message "Connection reset by peer".

Respective log of the OpenSearch data node:

[2024-04-12T06:52:05,486][ERROR][o.o.a.d.f.CustomWebhookDestinationFactory] [opensearch-data-1] Exception publishing Message: DestinationType: CUSTOMWEBHOOK, DestinationName:my-teams-channel, Url: https://<I_redacted_it>, scheme: HTTPS, Host: null, Port: -1, Path: null, Method: POST, Message: {
    "@type": "MessageCard",
    "@context": "http://schema.org/extensions",
    "themeColor": "0076D7",
    "summary": "My-Summary",
    "sections": [{
        "activityTitle": "OpenSearch Alert",
        "activitySubtitle": "Trigger: test-trigger",
        "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
        "facts": [{
            "name": "Period start",
            "value": "2024-04-12T04:52:05.475Z"
        }, {
            "name": "Period end",
            "value": "2024-04-12T06:52:05.475Z"
        }, {
            "name": "Error",
            "value": ""
        }, {
            "name": "Raw",
            "value": "{monitor={_id=Ow03zY4BedGE4-daHuXs, _version=5, name=Test alerts, enabled=true}, results=[{_shards={total=1, failed=0, successful=1, skipped=0}, hits={hits=[], total={value=2, relation=eq}, max_score=null}, took=3, timed_out=false}], periodStart=2024-04-12T04:52:05.475Z, periodEnd=2024-04-12T06:52:05.475Z, error=null, trigger={id=OQ03zY4BedGE4-daHuXj, name=test-trigger, severity=1, actions=[{name=My Action}]}, alert={acknowledged_time=null, id=Qw04zY4BedGE4-daDuUP, version=-1, end_time=null, error_message=null, last_notification_time=1712897527013, severity=1, start_time=1712839986700, state=ACTIVE, bucket_keys=null, parent_bucket_path=null}}"
        }],
        "markdown": true
    }]
}
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) [alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) [?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
[2024-04-12T06:52:05,490][INFO ][o.o.a.m.MonitorRunResult ] [opensearch-data-1] Internal error: java.net.SocketException: Connection reset. See the opensearch.log for details
java.lang.IllegalStateException: java.net.SocketException: Connection reset
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:36) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) ~[alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) ~[alerting-notification-1.3.15.0.jar:?]
    ... 11 more 

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a destination (in newer versions called "notification channel") with the webhook URL and in my case method POST and Content-Type: application/json
  2. Create a Monitor:

    • per query monitor; running every 2 hours; using sample-index as data source and my-time as time field;
    • For the query use the visual editor; configure to check the document count for the last 24days
    • For the trigger: trigger when doc count is above 0
    • For the action: send a message to the created destination, e.g. for teams (reference):
    {
        "@type": "MessageCard",
        "@context": "http://schema.org/extensions",
        "themeColor": "0076D7",
        "summary": "My-Summary",
        "sections": [{
            "activityTitle": "OpenSearch Alert",
            "activitySubtitle": "Trigger: {{ctx.trigger.name}}",
            "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
            "facts": [{
                "name": "Period start",
                "value": "{{ctx.periodStart}}"
            }, {
                "name": "Period end",
                "value": "{{ctx.periodEnd}}"
            }, {
                "name": "Error",
                "value": "{{ctx.error}}"
            }, {
                "name": "Raw",
                "value": "{{ctx}}"
            }],
            "markdown": true
        }]
    } 
  3. Verify that sending the test message works.
  4. Create the monitor.
  5. Wait for 1 hour & send the test message again to see the connection reset issue Or: Just wait e.g. 1 day until the alerts were triggered (every 2 hours). Some of them will be sent successfully and some of them are failed. For that: Click on the alert under "Alerts" to see its state.

What is the expected behavior? If the connection gets reset, I expect that OpenSearch will take care of re-establishing the connection and/or retry the requests.

What is your host/environment?

Do you have any screenshots?

Screenshot 2024-04-22 at 18 25 43

Do you have any additional context?

JannikBrand commented 5 months ago

When querying the alerts via GET _plugins/_alerting/monitors/alerts?monitorId=<id_placeholder>, this is a sample output (with OS 2.12.): The monitor was running 1 day, many alerts are send successfully while other alerts are not send due to the connection reset error. I also tried to run the monitor in 20 min intervals instead of 2hours. Almost every second alert is not sent.

{
  "alerts": [
    {
      "id": "sJSsBo8BomAnsTMDF0PN",
      "version": 11,
      "monitor_id": "kHM-Bo8Bgaj1Kgd4LjPY",
      "workflow_id": "",
      "workflow_name": "",
      "associated_alert_ids": [],
      "schema_version": 5,
      "monitor_version": 1,
      "monitor_name": "My Test Monitor",
      "execution_id": "kHM-Bo8Bgaj1Kgd4LjPY_2024-04-22T16:38:09.502091053_6c5c58a6-3eaf-4283-ae57-88801bb703f4",
      "trigger_id": "jnM-Bo8Bgaj1Kgd4LTP6",
      "trigger_name": "My-Test-Trigger",
      "finding_ids": [],
      "related_doc_ids": [],
      "state": "ACTIVE",
      "error_message": null,
      "alert_history": [
        {
          "timestamp": 1713868689538,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713854289430,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713839889444,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713825489518,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713811089532,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        }
      ],
      "severity": "1",
      "action_execution_results": [
        {
          "action_id": "notification167289",
          "last_execution_time": 1713875890185,
          "throttled_count": 0
        }
      ],
      "start_time": 1713803892665,
      "last_notification_time": 1713875890370,
      "end_time": null,
      "acknowledged_time": null
    }
  ],
  "totalAlerts": 1
}
sbcd90 commented 5 months ago

added to backlog.