Closed EngRajabi closed 3 years ago
Ad far as I recall, this is on purpose and also in other tracers the case. But the ones should timeout between the flushing cycles. This is what’s missing right now.
If the block value reaches a maximum, it is no longer possible to send the update command. This happens when the connection between the microservice and the server is down or the central server is down
I checked again with the implementation of the reference client in go and this is intentional (see https://github.com/jaegertracing/jaeger-client-go/blob/master/reporter.go#L283). You define with the queue size how many items at maximum should be cached. All other added later on will be marked as dropped. Flushing is also not triggered by the queue size limit being reached.
The only thing that might be missing is the timeout in the senders. But the RemoteReporter
itself is implemented as expected. The timeout will also make sure the queue gets emptied again and let new items being added.
@yurishkuro Can you tell me how the go client avoids the senders flushing getting stuck? I assume you have a timeout for the transport's flushing?
@EngRajabi The problem is not that our queue runs full when the server is not available, the problem is, that the senders flushing, triggered by the reporter, is blocking indefinitely. Once the senders timeout properly (most probably using a CancellationToken
when calling Flush
on the sender), the reporters flush will also return, freeing up the queue. So there is no need for making the RemoteReporter
more complicated by adding timers and other things. It behaves as specified as it is right now.
It's basically https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Thrift/ThriftSender.cs#L108-L115 and https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Grpc/GrpcSender.cs#L130-L145 which are blocking. Currently, https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L178-L179 starts the flushing with CancellationToken.None
. This should be the only change necessary to have flushing timeout at all. I just want to make sure, we do it similar to how the other clients implement/configure this timeout. https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L122-L125 will then mark those as dropped. So everything is in place already beside the timeout.
Any https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L96-L98 can be skipped as AppendCommand
and FlushCommand
should not take indefinitely, which ensures, that there will be enough space to enqueue another FlushCommand
, especially since AppendCommand
is nearly immediate and only appending. So not really clogging up the queue. This is referring to https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Thrift/ThriftSender.cs#L48-L56 where in the unlikely case, that there would never be enough space for a FlushCommand
, the AppendCommand
will also trigger a Flush
internally when the buffer is full.
I hope this clarifies why I'm against the changes in your PR. So please let's wait for @yurishkuro replying on what the standard timeout behavior is in other clients before trying to fix this any other way.
@EngRajabi I just did some tests and am not really able to see your problem. If the server is not available, the flushing should already fail as you can see here (with gRPC sender):
[09:25:26 ERR Jaeger.Reporters.RemoteReporter] QueueProcessor error
System.AggregateException: One or more errors occurred. (Failed to flush spans.)
---> Jaeger.Exceptions.SenderException: Failed to flush spans.
---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="failed to connect to all addresses", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1605687926.735000000","description":"Failed to pick subchannel","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\ext\filters\client_channel\client_channel.cc","file_line":4134,"referenced_errors":[{"created":"@1605687926.691000000","description":"failed to connect to all addresses","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\ext\filters\client_channel\lb_policy\pick_first\pick_first.cc","file_line":398,"grpc_status":14}]}")
at Jaeger.Senders.Grpc.GrpcSender.FlushAsync(CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at Jaeger.Senders.Grpc.GrpcSender.FlushAsync(CancellationToken cancellationToken)
at Jaeger.Reporters.RemoteReporter.FlushCommand.ExecuteAsync()
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at Jaeger.Reporters.RemoteReporter.ProcessQueueLoop()
The same should be valid for the Thrift sender. Can you try to add more information on what your configuration is and when it happens? I understood, that you can not add more spans if your queue is already full and they are not able to be flushed since the server is not reachable. But this is the whole point of the maxQueueSize
and that's on purpose. Those should get lost after a couple of seconds when the sending timeout comes, which frees up the queue again, appending and sending out the next batch. Therefore, the loop is freed and can process the next AppendCommand
and FlushCommand
being added later on. So no blocking of the queue.
This is expected when the endpoint is configured but not reachable. Usually you have an Jaeger-Agent as sidecar configured with UDP on the same host which then pushes the data to the collector at it‘s own rate. If you want to push data directly to the collector, either use the HttpSender from thrift or the GrpcSender.
Problem - If the maximum number of buffer blocks is filled. You can no longer run the update command
If the maximum number of complete blocks is filled. You can no longer run the update command. It is possible that in some cases we reach the maximum number of block items and can no longer order the update. This situation may occur when, for example, the server is not available