opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
251 stars 184 forks source link

[BUG] Document level bulk request error messages are overridden by bulk level error message when max limit is reached #3507

Open graytaylor0 opened 10 months ago

graytaylor0 commented 10 months ago

Describe the bug When a bulk request fails to write to OpenSearch, failures will be handled after the max_retries has been exhausted. However, when logging the failure or sending the failure to the DLQ, the bulk level message of Number of retries reached the limit of max retries (configured value %d), instead of using the document's bulkResponse with the error code and the exception. This makes it so document level failure root cause is hidden due to the code here (https://github.com/opensearch-project/data-prepper/blob/910f45161db66e536cf2e5e5efd9fa7bed68d9f9/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/sink/opensearch/BulkRetryStrategy.java#L251).

Expected behavior Given how clustered the code here is, I think it is simplest for us to add both the bulk-level error message (if it exists) as well as the document-level failure at all times to the failure message that is logged or sent to the dlq.

Additional context Add any other context about the problem here.

graytaylor0 commented 10 months ago

Related to #3504

dlvenable commented 9 months ago

Here is an example I got:

{
  "dlqObjects": [
    {
      "pluginId": "opensearch",
      "pluginName": "opensearch",
      "pipelineName": "test-pipeline",
      "failedData": {
        "index": "dlq-failures-aoss",
        "indexId": "a001",
        "status": 0,
        "message": "Number of retries reached the limit of max retries (configured value 8)",
        "document": {
          "id": "a001",
          "name": "Test001",
          "number": 145,
          "action": "index"
        }
      },
      "timestamp": "2023-11-14T17:18:10.326Z"
    },
    {
      "pluginId": "opensearch",
      "pluginName": "opensearch",
      "pipelineName": "test-pipeline",
      "failedData": {
        "index": "dlq-failures-aoss",
        "indexId": "a002",
        "status": 0,
        "message": "Number of retries reached the limit of max retries (configured value 8)",
        "document": {
          "id": "a002",
          "name": "Test002",
          "number": 200,
          "action": "index"
        }
      },
      "timestamp": "2023-11-14T17:18:10.328Z"
    },
    {
      "pluginId": "opensearch",
      "pluginName": "opensearch",
      "pipelineName": "test-pipeline",
      "failedData": {
        "index": "dlq-failures-aoss",
        "indexId": "a003",
        "status": 0,
        "message": "Number of retries reached the limit of max retries (configured value 8)",
        "document": {
          "id": "a003",
          "name": "Test003",
          "number": 200,
          "action": "index"
        }
      },
      "timestamp": "2023-11-14T17:18:10.329Z"
    },
    {
      "pluginId": "opensearch",
      "pluginName": "opensearch",
      "pipelineName": "test-pipeline",
      "failedData": {
        "index": "dlq-failures-aoss",
        "indexId": "a004",
        "status": 0,
        "message": "Number of retries reached the limit of max retries (configured value 8)",
        "document": {
          "id": "a004",
          "name": "Test004",
          "number": 400,
          "action": "index"
        }
      },
      "timestamp": "2023-11-14T17:18:10.329Z"
    },
    {
      "pluginId": "opensearch",
      "pluginName": "opensearch",
      "pipelineName": "test-pipeline",
      "failedData": {
        "index": "dlq-failures-aoss",
        "indexId": "a005",
        "status": 0,
        "message": "Number of retries reached the limit of max retries (configured value 8)",
        "document": {
          "id": "a005",
          "name": "Test005",
          "number": 500,
          "action": "index"
        }
      },
      "timestamp": "2023-11-14T17:18:10.329Z"
    }
  ]
}
dlvenable commented 9 months ago

I also saw this while working on #3644.

dlvenable commented 4 months ago

We should probably keep Number of retries reached the limit of max retries (configured value %d) as the prefix to these messages.

KarstenSchnitter commented 4 months ago

Is it possible to add data from the source or the event itself?

We have a use-case, where data comes in from different applications and might fail due to field type collisions. In that case, it would be helpful to identify the origin of the events. For OTel events, this can be done by the resource attributes, for JSON messages by particular fields of the message. Since DataPrepper parsed the message, it might have access to that kind of data to add to the DLQ message.