Fix panic when max_span_count is reached, add counter metric

nickbp commented 2 years ago

Panic seen in ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0 with log-level=debug:

panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap

goroutine 20 [running]:
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc)
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300)
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e
created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter
    github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226

Also adds metric counter and logging to surface when things are hitting backpressure.

Signed-off-by: Nick Parker nparker@gitlab.com

Which problem is this PR solving?

Fixes panic when recently added max_span_count flush is being exercised

Short description of the changes

Updates case type so that the panic no longer occurs
Adds debug log and metric counter to allow monitoring of the setting, since reaching the limit may indicate backpressure/problems with writing spans to ClickHouse. (In my case I was seeing the panic when ClickHouse's disk was full)

nickbp commented 2 years ago

After fixing the panic, I'm now seeing an issue where the worker.CLose() call blocks indefinitely and eventually jaeger-clickhouse stops logging any progress. I'm assuming that this is a deadlock somewhere in that CLose() logic, which eventually halts all progress. I guessed that it might be due to a double lock on the mutex and tried fixing that in 935ff78. But with that fix I'm still seeing the deadlock behavior, where the expected Existing batch finished message is never produced.

At this point I'm assuming the indefinite block on CLose() is due to an issue in the worker's internal sync/messaging handling. I'm thinking an easy way of resolving the situation would be to just discard new batches when the limit is being exceeded (while logging a warning/increasing a counter), rather than blocking on existing batches to finish. This then avoids the need for this special handling.

nickbp commented 2 years ago

I simplified the max_span_count handling, and am no longer seeing the earlier stalled behavior when the max was reached:

When max_span_count is exceeded, switched to discarding data (and recording it in log+metric), rather than waiting potentially indefinitely(!) for the existing workers to finish their current work
Moved pending span counter into pool, removing the need for mutexing
Treated max_span_count<=0 configuration as unlimited

nickbp commented 2 years ago

I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable.

jaegertracing / jaeger-clickhouse

Fix panic when max_span_count is reached, add counter metric #104

Which problem is this PR solving?

Short description of the changes