opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
259 stars 190 forks source link

[BUG] Dots Discovered Key Names #4977

Open Conklin-Spencer-bah opened 1 week ago

Conklin-Spencer-bah commented 1 week ago

Describe the bug Keys with "." in them are not able to be processed.

When ingesting logs from FluentBit -> S3 -> SQS -> Data Prepper / OSIS -> OpenSearch any key that has a dot "." in it is throwing an error on ingestion, see below error from OSIS. I believe this is because the Kubernetes metadata in labels contains dots.

2024-09-24T14:08:46.611 [s3-log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - operation = Index, status = 400, error = can't merge a non object mapping [kubernetes.labels.app] with an object mapping

The JSON blob looks as such

    "labels": {
      "app": "fooservice",
      "app.kubernetes.io/component": "foo",
      "app.kubernetes.io/instance": "foo-in-cluster",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "fooservice",
      "app.kubernetes.io/version": "somelonghash",

If these labels aren't in the log ingestion succeeds. One challenge is that the labels vary from service to service so predicting what they will be is difficult. It would be preferable if there was a way to say "If the key found has a "." (or some other char) substitute it with "_" or whatever the user chooses.

It is possible that this is able to be done and I am unaware on how to do so.

To Reproduce

Attempt to process and ingest a log file to OpenSearch with Data Prepper with a log that has Keys that contain dots "."

Such as:

    "labels": {
      "app": "fooservice",
      "app.kubernetes.io/component": "foo",
      "app.kubernetes.io/instance": "foo-in-cluster",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "fooservice",
      "app.kubernetes.io/version": "somelonghash",

Expected behavior The key in double quotes is processed as a key even when dots are present.

Environment (please complete the following information):

Additional context Seems this is related and was merged with a Fix. But it is unclear on how to resolve this issue.

https://github.com/opensearch-project/data-prepper/issues/450

KarstenSchnitter commented 6 days ago

Thanks for reporting this issue. This is actually a conflict between different field types in OpenSearch. During indexing, the document is rejected because of it. The issue arises, because OpenSearch interprets dots "." in field names as nested JSON objects. Let me take your sample data and reduce it a little to illustrate the issue.

Let's say, we want to index just the following document in OpenSearch:

{
  "labels": {
    "app": "fooservice",
    "app.kubernetes.io/component": "foo"
  }
}

OpenSearch expands the key app.kubernetes.io/component and gets a conflict:

{
  "labels": {
    "app": "fooservice",
    "app": {                            // Error, is "app" a string or an object?
      "kubernetes": {
        "io/component": "foo"
      }
    }
  }
}

This issue happens a lot, when logging K8s labels or annotations. It would also occur, if Fluent Bit wrote to OpenSearch directly and is not a bug in DataPrepper per se. You can work around this issue, by replacing the dots "." with underscores "_" using a small Lua script in Fluent Bit. We have developed this snippet for our own use-cases. Such a transformation is usually known by the name dedotting in case you want to google it.

Data Prepper faces a similar issue for OpenTelemetry attributes. Here its processors dedot the attribute names by replacing certain dots "." by "@". In that case, the dedotting is hard-coded into the OpenTelemetry processors of Data Prepper. I am not that experienced with the generic Data Prepper processors, to give an example using those. The main problem to me is, that you would not want to list all field names, that should be dedotted in the pipeline configuration. In your example, it could be applied to all fields under label, but it might be different for others.

Note, that any dedotting procedure increases the divide between deployment and observability due to the altered names. Unfortunately, there is no easy way around this. The unfolding of dotted names is a major feature of OpenSearch.