open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

[receiver/jaegerreceiver] Thrift packets drop #34462

Open larsn777 opened 3 months ago

larsn777 commented 3 months ago

Component(s)

receiver/jaeger

Describe the issue you're reporting

When using a jaeger receiver, we may periodically lose data on the collector due to the high incoming rate of thrift packets. In this case, the user does not even know that he is losing data on the receiver, since there are no metrics displaying these drops. If we consider that Jaeger libraries can send thrift data without delivery confirmation (oneway methods), we get a situation where the user has no way at all to know that data loss is occurring.

As one of the solutions to the problem, we can export metrics describing the number of processed/dropped thrift packets. A little bit later I can prepare and open the corresponding PR

github-actions[bot] commented 3 months ago

Pinging code owners:

yurishkuro commented 3 months ago

Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.

But speaking of those SDKs, the PR you have is not going to solve the problem because it only tracks packets received but not processed. But the other vector for loss is packets not even making it to the receiver because of overload. Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client:

jaeger_agent_client_stats_batches_sent_total 0
jaeger_agent_client_stats_connected_clients 0
jaeger_agent_client_stats_spans_dropped_total{cause="full-queue"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="send-failure"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="too-large"} 0
larsn777 commented 3 months ago

Hello

Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.

The short answer - legacy code) We want to move away from the Jaeger SDK on the client side, but we have more than 3k of microservices, so the process of updating client libraries can take quite a long time.

Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client

Yes, I know that client libraries can send statistics about the number of sent batches and errors. However, processing these statistics will not solve all problems with data loss:

  1. As far as I understand from the Thrift scheme, sending statistics is optional. Thus, there may be SDK versions in which client libraries will not send these statistics to the receiver.
  2. Even if we start processing client statistics on the receiver, we still need packets rejection metrics of the receiver itself. Otherwise, most likely, it will be difficult for us to determine the exact place where the data loss occurs.
  3. When using the OTelCol agent <-> gateway deployment scheme, agents can be placed on nodes with a large number of services. And in order to correctly process statistics, we need to be able to clearly identify statistics for each individual service with its own SDK instance.

In fact, I already have a draft code in which the receiver processes client statistics. If I will have some free time, I will try to open PR it in the near future.