Closed nickbp closed 2 years ago
After fixing the panic, I'm now seeing an issue where the worker.CLose()
call blocks indefinitely and eventually jaeger-clickhouse stops logging any progress. I'm assuming that this is a deadlock somewhere in that CLose()
logic, which eventually halts all progress. I guessed that it might be due to a double lock on the mutex and tried fixing that in 935ff78
. But with that fix I'm still seeing the deadlock behavior, where the expected Existing batch finished
message is never produced.
At this point I'm assuming the indefinite block on CLose()
is due to an issue in the worker's internal sync/messaging handling. I'm thinking an easy way of resolving the situation would be to just discard new batches when the limit is being exceeded (while logging a warning/increasing a counter), rather than blocking on existing batches to finish. This then avoids the need for this special handling.
I simplified the max_span_count
handling, and am no longer seeing the earlier stalled behavior when the max was reached:
I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable.
Panic seen in
ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0
withlog-level=debug
:Also adds metric counter and logging to surface when things are hitting backpressure.
Signed-off-by: Nick Parker nparker@gitlab.com
Which problem is this PR solving?
Short description of the changes