Closed esatterwhite closed 3 years ago
Hi all! Any idea when this fix will be merged?
I'm still seeing a situation where it will stop flushing after some time. I think there is an error being swallowed in a thread. digging
@matt-march @jakedipity I think I tracked down the deadlock. There was an error case that would result in a lock not being released and eventually all threads would be locked.
I was able to log successfully to the dev environment for several hours. previously, it would lock up after a few minutes. Please take a good look
Also because the flush loop is really just a 1-shot timeout that is manually restarted, I am pretty sure there was a situation where It would try to schedule some work on a thread and there was enough time spent in a context switch w/ an error that the timer would trigger and it would go un-noticed - and none of code paths would know to restart it.
Also resulting in "nothing happening".
In the case that work cannot be scheduled on the worker pool, logs are pushed onto the
secondary
list and the check before a send doesn't account for any logs that may be pending in thesecondary
.This can happen in the case of a run time error when attempting to submit work to one of the thread pools effectively leaving logs unsent.
fixes: #74