Closed GuptaManan100 closed 1 week ago
I looked at the other engines that use the schema engine, and it turns out the health streamer has a similar deadlock -
Operations for close -
mu
mutex from messager engine.UnregisterNotifier
, which acquires the notifierMu
mutex.The order of operations during a broadcast call from the schema engine are as follows -
notifierMu
mutex.reload
method.mu
mutex lock.I'm glad you found this. There have been some deadlock issues in the past that were hard to diagnose. I wonder if this was the underlying cause, and the other changes mitigated the issue without solving the root concern. We haven't run into this internally that I'm aware of.
I'm curious, were you doing something related to messaging that caused you to find it, or was this a side effect of other work? I'm always curious how much usage messaging is getting.
I don't know a lot of details, but I was investigating a failure that caused DemotePrimary
to be indefinitely stuck, and then I found that we had this deadlock! The messager was stuck in Close
waiting for the mutex, and then I realized, an outstanding Broadcast call was holding it.
Overview of the Issue
It was noticed that there is a deadlock in the messager engine code. When we Close the messager engine. The order of operations are as follows -
mu
mutex from messager engine.UnregisterNotifier
, which acquires thenotifierMu
mutex.The order of operations during a broadcast call from the schema engine are as follows -
notifierMu
mutex.schemaChanged
method.mu
mutex lock.From the order of operations it is clear that we can reach a deadlock if two go routines running the order of operations defined above, are able to acquire the first lock respectively. They will fail to acquire the second lock and will continue to wait indefinitely.
This can cause
DemotePrimary
to block as messager engineClose()
is a synchronous call in that flow.Reproduction Steps
This is very hard to reproduce in a e2e fashion, but can be observed manually by looking at the code, and trying to call
schemaChange
andClose
in parallel.Binary Version
Operating System and Environment details
Log Fragments
No response