opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
262 stars 200 forks source link

Local dataprepper DLQ file is not getting filled with dropped traces #4736

Open dontknowany opened 3 months ago

dontknowany commented 3 months ago

Hi all,

I'm not sure if I'm missing something or if I'm doing something wrong on my end.

In my current data prepper setup there are traces that are getting dropped/sent to the logs of data prepper itself which pollutes its own logs and therefore important messages are getting swallowed in between all the dropped/logged traces. I would like to investigate those dropped traces further so that's why I added the local DLQ path to my pipeline config under the sink section so that those traces are getting written/exported into this file for further troubleshooting instead of the logs.

The problem that I'm facing now is that the DLQ file stays empty and is not being filled with the dropped traces. They are still getting dropped into the logs of data prepper. Data prepper is also not compaining about a misconfigured DLQ config in the logs. The DLQ file just stays empty for some reason.

The option to use AWS S3 for the DLQs is not working for me since we don't use it in our tech stack. We would either need support to use GCS or other 3rd party S3 options.

I found the config for the DLQ path here: https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/README.md

This is my current pipeline.yaml file:

otel-collector-pipeline:
   workers: 8
   delay: "100"
   source:
     otel_trace_source:
       ssl: false
       port: 9000
       health_check_service: true
   buffer:
     bounded_blocking:
       buffer_size: 25600
       batch_size: 400
   sink:
   - pipeline: 
       name: raw-pipeline
   - pipeline: 
       name: service-map-pipeline
raw-pipeline:
  workers: 8
  delay: "3000"
  source:
    pipeline:
      name: otel-collector-pipeline
  buffer:
    bounded_blocking:
      buffer_size: 25600
      batch_size: 3200
  processor:
  - otel_traces:
  - otel_trace_group:
      hosts: [ "https://api.of.our.opensearch.instance.com" ]
      username: dataprepper-user
      password: xxxxxxxxxxxxxxx
  sink:
  - opensearch:
      hosts: [ "https://api.of.our.opensearch.instance.com" ]
      username: dataprepper-user
      password: xxxxxxxxxxxxxxx
      index_type: trace-analytics-raw
      dlq_file: /usr/share/data-prepper/log/dlq-file
service-map-pipeline:
  workers: 8
  delay: "100"
  source:
    pipeline:
      name: otel-collector-pipeline
  processor:
  - service_map:
      window_duration: 180
  buffer:
    bounded_blocking:
      buffer_size: 25600
      batch_size: 400
  sink:
  - opensearch:
      hosts: [ "https://api.of.our.opensearch.instance.com" ]
      username: dataprepper-user
      password: xxxxxxxxxxxxxxx
      index_type: trace-analytics-service-map
      dlq_file: /usr/share/data-prepper/log/dlq-file

Maybe I misunderstand what the DLQ is/should be used for and I thought it would be suitable for my use case. I also don't know if it is expected behaviour for data prepper to write the complete traces into his own logs.

I'm using the newest release of data prepper (2.8.0)

I hope someone can shed some light on my problem and help me fix it.

If you need any additional information that I did not provide or forgot to provide, just ask and I will try my best to give you that information.

Cheers!

dlvenable commented 3 months ago

@dontknowany , Do you have any logs from Data Prepper itself that may help us understand what the cause might be? Also, do you see if the file is being created by Data Prepper at all? Is it empty? Or no file?

dontknowany commented 3 months ago

@dlvenable Sadly there seem to be no useful logs written by Data Prepper besides the normal startup logs which do not tell anything that is out of the ordinary. DP is not complaining about a invalid pipeline config nor is it crashing. DP starts up like normal and does create the file named "dlq-file" in its file system under the specified path but the file stays empty and DP is still throwing all "invalid" traces into his own logs like before.

kkondaka commented 3 months ago

@dontknowany DataPrepper doesn't put invalid traces to DLQ. It is a feature we are working on but for now the events are sent to DLQ only when they are failed to written to sink

dontknowany commented 3 months ago

@kkondaka Thank you for your answer :) Would be nice to have such a feature in the future. Glad to hear that you guys are working on that :)

Another question: Why is DataPrepper putting traces into his logs then? Is it just for "documentation" sake and for seeing that something is happening? Are the traces that are inside the logs of DataPrepper wrong or invalid in any shape or form or is it expected behaviour that they are logged that way? :)

KarstenSchnitter commented 3 months ago

Maybe it would help, to get some metrics from DataPrepper. It usually indicates the buffer utilization quite nicely. We might get an idea, where your data is stuck. It would also be good to learn the status of the circuit breakers.

dontknowany commented 3 months ago

@KarstenSchnitter Regarding metrics: We have a complete Dashboard in Grafana for different DataPrepper metrics. Buffer usage, heap usage, cpu usage, records written/read from buffer, latency, records written to OpenSearch and so on. You name it, we got it , so to speak.

The interesting thing is that there is nothing shown in the metrics which would indicate a issue. There are no records which failed to be written into the buffer nor errors from opensearch regarding failed bulk ingestion server requests.

Regarding the circuit breaker: We have it in place but only since a month or so. We did not utilize DataPrepper to the limit which is why we did not have it configured in the past. Although our load has not changed I still configured it now just in case.

KarstenSchnitter commented 3 months ago

@dontknowany How do the write rates of the collector-pipeline compare to the raw-pipeline and the messages indexed in OpenSearch? How does it compare to the actual rate of sent messages? Can you provide an example of a failing event? If the mapping of the event fails, it will not be written to the buffer at all, if I remember correctly.

dontknowany commented 3 months ago

@KarstenSchnitter Looking at our metrics, there are on average 400 messages coming in into our stack through our 3 collector pipelines/instances. Then they are all getting pushed to Kafka which we use for buffering and after that there is one otel ingestor instance which does some light filtering and downsampling to 25%. So in the end there are getting roughly 100ish messages ingested into OpenSearch every second using DataPrepper. I don't see a issue with that. Looks fine to me.

Regarding a example of a failed event: I can give you a example of a json trace which gets logged by DataPrepper. The only thing that I noticed about the trace below is that the trace_id and the span_id look a bit off. Could that be the reason why the trace is getting logged? I also can't find the trace in OpenSearch (which makes sense if the trace is invalid)

  spans {
  trace_id: "\371QB\034\177\027\005\034\020\024\274\207\317\"\266\272"
  span_id: "\204\350\271s\355\016\032\006"
  name: "Accounts/user-settings"
  kind: SPAN_KIND_SERVER
  start_time_unix_nano: 1723022929638597800
  end_time_unix_nano: 1723022929642042400
  attributes {
    key: "net.host.name"
    value {
      string_value: "application-frontend.prod.some.app.in.somecompany.com"
    }
  }
  attributes {
    key: "http.method"
    value {
      string_value: "GET"
    }
  }
  attributes {
    key: "http.scheme"
    value {
      string_value: "http"
    }
  }
  attributes {
    key: "http.target"
    value {
      string_value: "/accounts/user-settings"
    }
  }
  attributes {
    key: "http.url"
    value {
      string_value: "http://application-frontend.prod.some.app.in.somecompany.com/accounts/user-settings"
    }
  }
  attributes {
    key: "http.flavor"
    value {
      string_value: "1.1"
    }
  }
  attributes {
    key: "http.user_agent"
    value {
      string_value: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edge/127.0.0.0"
    }
  }
  attributes {
    key: "http.route"
    value {
      string_value: "Accounts/user-settings"
    }
  }
  attributes {
    key: "http.status_code"
    value {
      int_value: 200
    }
  }
  status {
  }
}

I don't know why this trace is getting logged by DataPrepper. Its disrupting our log messages since it floods the logs and the important logs are getting buried in between those big trace logs. And it also generates a lot of logs in OpenSearch since every line of the trace above is one log message in OpenSearch :/

KarstenSchnitter commented 3 months ago

Indeed the trace and span ids look weird. They are supposed to be a 16 byte and an 8 byte array respectively by the OTLP protobuf definition. DataPrepper maps these fields in class OTelProtoCodec with function convertByteStringToString:

import org.apache.commons.codec.binary.Hex;
// ...

public static String convertByteStringToString(ByteString bs) {
    return Hex.encodeHexString(bs.toByteArray());
}

This means, the document to be indexed in OpenSearch should contain a 32 character hex string for the trace id and a 16 character hex string for the span id. This is clearly not the case in your example, where the values appear to be unmapped.

I am also curious about the empty status field. I cannot find any such field in the class JacksonSpan](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/JacksonSpan.java) or the associated interface Span. This makes me wonder, how you obtained the log of the event. Is this from DataPrepper or the OpenTelemetry Collector?

dontknowany commented 2 months ago

The event that I sent was indeed from DataPrepper and not from one of the OpenTelemetry Collectors. Our complete trace pipeline looks like this just to clarify our setup.

Application --> Otel Collector --> Kafka --> Otel Collector --> DataPrepper --> OpenSearch

KarstenSchnitter commented 2 months ago

Do you have by any chance an event from the OTel Collector, that was sent to DataPrepper resulting in an output like you pasted earlier?

dontknowany commented 1 month ago

Sorry for not being as active regarding this issue. I was on vacation and there were also other things that needed some attention.

Regarding your question: Our OTel Collectors are really quiet regarding logs/events. They only spit out some stuff when there is a real issue for example DataPrepper being unavailable. We are only getting a lot of events from DataPrepper itself and not from our OTel Collectors. Though we are getting events from our OTel Collector which we use for getting traces out of kafka and ingesting them into DataPrepper.

This is one of the events:

2024-09-25T06:32:15.823Z error exporterhelper/queue_sender.go:92 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlp/data-prepper", "error": "not retryable error: Permanent error: rpc error: code = InvalidArgument desc = name cannot be an empty string", "dropped_items":16} go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSenderfunc1 go.opentelemetry.io/collector/exporter@v0.107.0/exporterhelper/queue_sendergo:92 go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume go.opentelemetry.io/collector/exporter@v0.107.0/internal/queue/bounded_memory_queue.go:52 go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Startfunc1 go.opentelemetry.io/collector/exporter@v0.107.0/internal/queue/consumers.go:43

I would assume that if the OTel Collector drops the invalid data then it would not cause any event in DataPrepper since the dropped items would never reach it.

KarstenSchnitter commented 1 month ago

I was wondering, if you could use the debug exporter with detailed verbosity to log the trace, that is rejected by DataPrepper to the console. This is also described in the Troubleshooting section of the OpenTelemetry Collector configuration.

dontknowany commented 1 month ago

Good idea. I will enable that and see if I get any useful information out of those debug logs and then let you know with some examples