Closed danielgranhao closed 1 year ago
Stacktraces of threads. Thread 22
gets deadlocked on total_consistency_lock
and locks other threads.
gdb.txt
Ah! Damn, yea, std's RwLock
behavior is platform-dependent. It looks like in this case a pending writer blocks readers which would cause an issue. Will think on if we should explicitly exempt same-thread readers in our own wrapper or if we should ban duplicate reads.
Or PersistenceNotifierGuard
can drop _read_guard
explicitly before calling persistence_notifier.notify()
.
That wouldn't address this issue. The issue here is that, one one platform, a thread which is requesting the write
lock for the total_consistency_lock
will block additional threads from taking a read
lock. Because its not a reentrant lock, thread (a) is holding the read lock, and then thread (b) tries to acquire the write lock, then thread (a) tries to take a second read lock, which blocks indefinitely. (b) can't wake cause (a) is holding a read lock, but (a) can't make further progress because there's a thread waiting on the write lock.
Totally agree with what you said, but I mean another thing.
PersistenceNotifierGuard
calls persistence_notifier.notify()
while holding read lock on _read_guard
. In this instance channel_monitor_updated()
constructs PersistenceNotifierGuard
(acquires read lock) and passes self.persistence_notifier
which also wants to acquire read lock. So this deadlock happens.
The problem is with PersistenceNotifierGuard
is that it calls a function which it does not know about while holding read lock. Which is a design issue in my opinion.
Using a reentrant lock will solve this problem with the deadlock, but would not solve the design issue.
PersistenceNotifierGuard
is an RAII lock, this is (at least relatively) clear from the name - Guard
, so its ultimately up to the caller to avoid deadlocks, as with any RAII lock. PersistenceNotifierGuard
doesn't call any methods which it doesn't know anything about, notify
is super well-defined and will not block, so doing it while holding a local is totally fine (and also accepted practice), its also not an issue at all.
PersistenceNotifierGuard::optionally_notify()
takes persist_check
which can be any function. It is a dangerous idea to call unknown function under the lock, that is what I mean.
And this design issue manifests itself into a bug when get_and_clear_pending_msg_events()
pass a big lambda function with process_pending_monitor_events()
which tries to acquire lock on total_consistency_lock
.
The design there is a bit weird, but its not an "unknown function" by any means - the function, as used everywhere, is just an inline closure that exists so that we can take an RAII lock but decide whether to notify event handlers or not during the body, rather than before we lock. If you look at the callsites of it it all looks (and acts) exactly like a simple RAII lock, even if the (internal only) API itself could theoretically be used in some other way.
I think my team and I stumbled upon a deadlock bug on LDK. It goes like this:
ChainMonitor::channel_monitor_updated()
ChannelManager::get_and_clear_pending_msg_events()
eventually gets called, takes a read lock ontotal_consistency_lock
and callsprocess_pending_monitor_events()
MonitorEvent::Completed
, soChannelManager::channel_monitor_updated()
is called, which also takes a read lock ontotal_consistency_lock
If between the 2 read locks in steps 2. and 3. another concurrent task tries to get a write lock, a deadlock can occur, depending on the queuing policy of the OS. On my machine (MacOS) I never experienced this, but on Linux machines, we get random hangs. It's likely the
BackgroundProcessor
callingpersist_manager
, which takes a write lock ontotal_consistency_lock
inside thewrite()
method ofChannelManager
.