lightningdevkit / rust-lightning

A highly modular Bitcoin Lightning library written in Rust. It's rust-lightning, not Rusty's Lightning!
Other
1.15k stars 364 forks source link

Deadlock when using async monitor persistence #2000

Closed danielgranhao closed 1 year ago

danielgranhao commented 1 year ago

I think my team and I stumbled upon a deadlock bug on LDK. It goes like this:

  1. We call ChainMonitor::channel_monitor_updated()
  2. ChannelManager::get_and_clear_pending_msg_events() eventually gets called, takes a read lock on total_consistency_lock and calls process_pending_monitor_events()
  3. One of the pending monitor events is MonitorEvent::Completed, so ChannelManager::channel_monitor_updated() is called, which also takes a read lock on total_consistency_lock

If between the 2 read locks in steps 2. and 3. another concurrent task tries to get a write lock, a deadlock can occur, depending on the queuing policy of the OS. On my machine (MacOS) I never experienced this, but on Linux machines, we get random hangs. It's likely the BackgroundProcessor calling persist_manager, which takes a write lock on total_consistency_lock inside the write() method of ChannelManager.

andrei-21 commented 1 year ago

Stacktraces of threads. Thread 22 gets deadlocked on total_consistency_lock and locks other threads. gdb.txt

TheBlueMatt commented 1 year ago

Ah! Damn, yea, std's RwLock behavior is platform-dependent. It looks like in this case a pending writer blocks readers which would cause an issue. Will think on if we should explicitly exempt same-thread readers in our own wrapper or if we should ban duplicate reads.

andrei-21 commented 1 year ago

Or PersistenceNotifierGuard can drop _read_guard explicitly before calling persistence_notifier.notify().

TheBlueMatt commented 1 year ago

That wouldn't address this issue. The issue here is that, one one platform, a thread which is requesting the write lock for the total_consistency_lock will block additional threads from taking a read lock. Because its not a reentrant lock, thread (a) is holding the read lock, and then thread (b) tries to acquire the write lock, then thread (a) tries to take a second read lock, which blocks indefinitely. (b) can't wake cause (a) is holding a read lock, but (a) can't make further progress because there's a thread waiting on the write lock.

andrei-21 commented 1 year ago

Totally agree with what you said, but I mean another thing. PersistenceNotifierGuard calls persistence_notifier.notify() while holding read lock on _read_guard. In this instance channel_monitor_updated() constructs PersistenceNotifierGuard (acquires read lock) and passes self.persistence_notifier which also wants to acquire read lock. So this deadlock happens. The problem is with PersistenceNotifierGuard is that it calls a function which it does not know about while holding read lock. Which is a design issue in my opinion. Using a reentrant lock will solve this problem with the deadlock, but would not solve the design issue.

TheBlueMatt commented 1 year ago

PersistenceNotifierGuard is an RAII lock, this is (at least relatively) clear from the name - Guard, so its ultimately up to the caller to avoid deadlocks, as with any RAII lock. PersistenceNotifierGuard doesn't call any methods which it doesn't know anything about, notify is super well-defined and will not block, so doing it while holding a local is totally fine (and also accepted practice), its also not an issue at all.

andrei-21 commented 1 year ago

PersistenceNotifierGuard::optionally_notify() takes persist_check which can be any function. It is a dangerous idea to call unknown function under the lock, that is what I mean. And this design issue manifests itself into a bug when get_and_clear_pending_msg_events() pass a big lambda function with process_pending_monitor_events() which tries to acquire lock on total_consistency_lock.

TheBlueMatt commented 1 year ago

The design there is a bit weird, but its not an "unknown function" by any means - the function, as used everywhere, is just an inline closure that exists so that we can take an RAII lock but decide whether to notify event handlers or not during the body, rather than before we lock. If you look at the callsites of it it all looks (and acts) exactly like a simple RAII lock, even if the (internal only) API itself could theoretically be used in some other way.