vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

Bug Report: Deadlock in messager engine #17229

Closed GuptaManan100 closed 1 week ago

GuptaManan100 commented 1 week ago

Overview of the Issue

It was noticed that there is a deadlock in the messager engine code. When we Close the messager engine. The order of operations are as follows -

  1. Acquire mu mutex from messager engine.
  2. Call UnregisterNotifier, which acquires the notifierMu mutex.

The order of operations during a broadcast call from the schema engine are as follows -

  1. We acquire the notifierMu mutex.
  2. Go over all the notifier functions and call them. In the case of messager engine, it is the schemaChanged method.
  3. This function acquires the mu mutex lock.

From the order of operations it is clear that we can reach a deadlock if two go routines running the order of operations defined above, are able to acquire the first lock respectively. They will fail to acquire the second lock and will continue to wait indefinitely.

This can cause DemotePrimary to block as messager engine Close() is a synchronous call in that flow.

Reproduction Steps

This is very hard to reproduce in a e2e fashion, but can be observed manually by looking at the code, and trying to call schemaChange and Close in parallel.

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

GuptaManan100 commented 1 week ago

I looked at the other engines that use the schema engine, and it turns out the health streamer has a similar deadlock -

Operations for close -

  1. Acquire mu mutex from messager engine.
  2. Call UnregisterNotifier, which acquires the notifierMu mutex.

The order of operations during a broadcast call from the schema engine are as follows -

  1. We acquire the notifierMu mutex.
  2. Go over all the notifier functions and call them. In the case of health streamer, it is the reload method.
  3. This function acquires the mu mutex lock.
derekperkins commented 1 week ago

I'm glad you found this. There have been some deadlock issues in the past that were hard to diagnose. I wonder if this was the underlying cause, and the other changes mitigated the issue without solving the root concern. We haven't run into this internally that I'm aware of.

I'm curious, were you doing something related to messaging that caused you to find it, or was this a side effect of other work? I'm always curious how much usage messaging is getting.

GuptaManan100 commented 1 week ago

I don't know a lot of details, but I was investigating a failure that caused DemotePrimary to be indefinitely stuck, and then I found that we had this deadlock! The messager was stuck in Close waiting for the mutex, and then I realized, an outstanding Broadcast call was holding it.