jaegertracing / jaeger-client-csharp

🛑 This library is DEPRECATED!
https://jaegertracing.io/
Apache License 2.0
302 stars 67 forks source link

buffer block size #193

Closed EngRajabi closed 3 years ago

EngRajabi commented 4 years ago

Problem - If the maximum number of buffer blocks is filled. You can no longer run the update command

If the maximum number of complete blocks is filled. You can no longer run the update command. It is possible that in some cases we reach the maximum number of block items and can no longer order the update. This situation may occur when, for example, the server is not available

Falco20019 commented 4 years ago

Ad far as I recall, this is on purpose and also in other tracers the case. But the ones should timeout between the flushing cycles. This is what’s missing right now.

EngRajabi commented 4 years ago

If the block value reaches a maximum, it is no longer possible to send the update command. This happens when the connection between the microservice and the server is down or the central server is down

Falco20019 commented 4 years ago

I checked again with the implementation of the reference client in go and this is intentional (see https://github.com/jaegertracing/jaeger-client-go/blob/master/reporter.go#L283). You define with the queue size how many items at maximum should be cached. All other added later on will be marked as dropped. Flushing is also not triggered by the queue size limit being reached.

The only thing that might be missing is the timeout in the senders. But the RemoteReporter itself is implemented as expected. The timeout will also make sure the queue gets emptied again and let new items being added.

@yurishkuro Can you tell me how the go client avoids the senders flushing getting stuck? I assume you have a timeout for the transport's flushing?

Falco20019 commented 4 years ago

@EngRajabi The problem is not that our queue runs full when the server is not available, the problem is, that the senders flushing, triggered by the reporter, is blocking indefinitely. Once the senders timeout properly (most probably using a CancellationToken when calling Flush on the sender), the reporters flush will also return, freeing up the queue. So there is no need for making the RemoteReporter more complicated by adding timers and other things. It behaves as specified as it is right now.

It's basically https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Thrift/ThriftSender.cs#L108-L115 and https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Grpc/GrpcSender.cs#L130-L145 which are blocking. Currently, https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L178-L179 starts the flushing with CancellationToken.None. This should be the only change necessary to have flushing timeout at all. I just want to make sure, we do it similar to how the other clients implement/configure this timeout. https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L122-L125 will then mark those as dropped. So everything is in place already beside the timeout.

Any https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Jaeger.Core/Reporters/RemoteReporter.cs#L96-L98 can be skipped as AppendCommand and FlushCommand should not take indefinitely, which ensures, that there will be enough space to enqueue another FlushCommand, especially since AppendCommand is nearly immediate and only appending. So not really clogging up the queue. This is referring to https://github.com/jaegertracing/jaeger-client-csharp/blob/e3b70cdff25dc6406883b6b0681daa10e28441de/src/Senders/Jaeger.Senders.Thrift/ThriftSender.cs#L48-L56 where in the unlikely case, that there would never be enough space for a FlushCommand, the AppendCommand will also trigger a Flush internally when the buffer is full.

I hope this clarifies why I'm against the changes in your PR. So please let's wait for @yurishkuro replying on what the standard timeout behavior is in other clients before trying to fix this any other way.

Falco20019 commented 4 years ago

@EngRajabi I just did some tests and am not really able to see your problem. If the server is not available, the flushing should already fail as you can see here (with gRPC sender):

[09:25:26 ERR Jaeger.Reporters.RemoteReporter] QueueProcessor error
System.AggregateException: One or more errors occurred. (Failed to flush spans.)
 ---> Jaeger.Exceptions.SenderException: Failed to flush spans.
 ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="failed to connect to all addresses", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1605687926.735000000","description":"Failed to pick subchannel","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\ext\filters\client_channel\client_channel.cc","file_line":4134,"referenced_errors":[{"created":"@1605687926.691000000","description":"failed to connect to all addresses","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\ext\filters\client_channel\lb_policy\pick_first\pick_first.cc","file_line":398,"grpc_status":14}]}")
   at Jaeger.Senders.Grpc.GrpcSender.FlushAsync(CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Jaeger.Senders.Grpc.GrpcSender.FlushAsync(CancellationToken cancellationToken)
   at Jaeger.Reporters.RemoteReporter.FlushCommand.ExecuteAsync()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at Jaeger.Reporters.RemoteReporter.ProcessQueueLoop()

The same should be valid for the Thrift sender. Can you try to add more information on what your configuration is and when it happens? I understood, that you can not add more spans if your queue is already full and they are not able to be flushed since the server is not reachable. But this is the whole point of the maxQueueSize and that's on purpose. Those should get lost after a couple of seconds when the sending timeout comes, which frees up the queue again, appending and sending out the next batch. Therefore, the loop is freed and can process the next AppendCommand and FlushCommand being added later on. So no blocking of the queue.

EngRajabi commented 4 years ago

Screenshot_20201118-202615_Chrome

Falco20019 commented 4 years ago

This is expected when the endpoint is configured but not reachable. Usually you have an Jaeger-Agent as sidecar configured with UDP on the same host which then pushes the data to the collector at it‘s own rate. If you want to push data directly to the collector, either use the HttpSender from thrift or the GrpcSender.