The InMemoryCollector can crash somehow and then refinery's just not working right

honeycombio / refinery

Refinery is a trace-aware tail-based sampling proxy. It examines whole traces and intelligently applies sampling decisions (whether to keep or discard) to each trace.

291 stars 91 forks source link

The InMemoryCollector can crash somehow and then refinery's just not working right #1348

Open kentquirk opened 3 weeks ago

kentquirk commented 3 weeks ago

I've seen a customer say that this happened to them.

In my case, I've got some code changes that are crashing InMemoryCollector; the following log message came out:

ERRO[0005] IsAlive: subsystem dead due to timeout        subsystem=collector

If this happens, the system is dead and won't come back to life. It also will hang forever because it crashed while holding a lock.

I've addressed the shutdown problem with #1347 but we should figure out whether there's a lock issue in collector.

VinozzZ commented 3 weeks ago

Would it be possible that the timeout is actually due to a large amount of span redistribution during peer membership change? The liveness check is currently set to 3s in the InMemoryCollector. If there's a giant trace or lots of traces need to be redistributed to other peers, the redistributeTraces function may stuck doing the work before the collect() loop can report to the healthcheck again. Maybe we need to give the health check a longer timeout?