Open larsn777 opened 3 months ago
Pinging code owners:
receiver/jaeger: @yurishkuro
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.
But speaking of those SDKs, the PR you have is not going to solve the problem because it only tracks packets received but not processed. But the other vector for loss is packets not even making it to the receiver because of overload. Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client:
jaeger_agent_client_stats_batches_sent_total 0
jaeger_agent_client_stats_connected_clients 0
jaeger_agent_client_stats_spans_dropped_total{cause="full-queue"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="send-failure"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="too-large"} 0
Hello
Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.
The short answer - legacy code) We want to move away from the Jaeger SDK on the client side, but we have more than 3k of microservices, so the process of updating client libraries can take quite a long time.
Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client
Yes, I know that client libraries can send statistics about the number of sent batches and errors. However, processing these statistics will not solve all problems with data loss:
In fact, I already have a draft code in which the receiver processes client statistics. If I will have some free time, I will try to open PR it in the near future.
Component(s)
receiver/jaeger
Describe the issue you're reporting
When using a jaeger receiver, we may periodically lose data on the collector due to the high incoming rate of thrift packets. In this case, the user does not even know that he is losing data on the receiver, since there are no metrics displaying these drops. If we consider that Jaeger libraries can send thrift data without delivery confirmation (oneway methods), we get a situation where the user has no way at all to know that data loss is occurring.
As one of the solutions to the problem, we can export metrics describing the number of processed/dropped thrift packets. A little bit later I can prepare and open the corresponding PR