Open kentquirk opened 3 weeks ago
Would it be possible that the timeout is actually due to a large amount of span redistribution during peer membership change?
The liveness check is currently set to 3s in the InMemoryCollector. If there's a giant trace or lots of traces need to be redistributed to other peers, the redistributeTraces
function may stuck doing the work before the collect()
loop can report to the healthcheck again. Maybe we need to give the health check a longer timeout?
I've seen a customer say that this happened to them.
In my case, I've got some code changes that are crashing InMemoryCollector; the following log message came out:
If this happens, the system is dead and won't come back to life. It also will hang forever because it crashed while holding a lock.
I've addressed the shutdown problem with #1347 but we should figure out whether there's a lock issue in collector.