jaegertracing / jaeger-clickhouse

Jaeger ClickHouse storage plugin implementation
Apache License 2.0
233 stars 50 forks source link

Fix panic when max_span_count is reached, add counter metric #104

Closed nickbp closed 2 years ago

nickbp commented 2 years ago

Panic seen in ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0 with log-level=debug:

panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap

goroutine 20 [running]:
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc)
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300)
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e
created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226

Also adds metric counter and logging to surface when things are hitting backpressure.

Signed-off-by: Nick Parker nparker@gitlab.com

Which problem is this PR solving?

Short description of the changes

nickbp commented 2 years ago

After fixing the panic, I'm now seeing an issue where the worker.CLose() call blocks indefinitely and eventually jaeger-clickhouse stops logging any progress. I'm assuming that this is a deadlock somewhere in that CLose() logic, which eventually halts all progress. I guessed that it might be due to a double lock on the mutex and tried fixing that in 935ff78. But with that fix I'm still seeing the deadlock behavior, where the expected Existing batch finished message is never produced.

At this point I'm assuming the indefinite block on CLose() is due to an issue in the worker's internal sync/messaging handling. I'm thinking an easy way of resolving the situation would be to just discard new batches when the limit is being exceeded (while logging a warning/increasing a counter), rather than blocking on existing batches to finish. This then avoids the need for this special handling.

nickbp commented 2 years ago

I simplified the max_span_count handling, and am no longer seeing the earlier stalled behavior when the max was reached:

nickbp commented 2 years ago

I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable.