Closed jbencin closed 2 months ago
There was a version of tx_busy_handler()
in stacks-common also that I had to update
Rebased and fixed a conflict, please re-approve
which thread is holding the transaction at the time the caller tries (and fails) to run
tx_begin_immediate()
? To do this, we'd need to somehow get the lock-holder to print a stack trace
It almost certainly would with this PR. The first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs
the first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs
As mentioned in the call, it's entirely possible that the thread acquires one lock and then spins while trying to acquire the second lock (or just spins in general, without regards to any other locks), which would not result in its stack trace being printed.
it's entirely possible that the thread acquires one lock and then spins
I see what you mean now, if you don't set the busy handler, rusqlite defaults to returning an SQLITE_BUSY
error instead. I grep
ed the code and we don't check for SQLITE_BUSY
, so it seems unlikely we spin on this anywhere, but I guess it's possible we spin on all errors somewhere, and I don't know how I'd find that to fix it (I guess I'd have to check all the rusqlite functions which can acquire a lock)
How about I reduce the scope of this PR to just upgrading the logging to warn!
and error!
?
Also let me know what you think about using a re-entrant mutex as described in #5112
Gah, it appears Github ate my original comment. I don't know if it hit your inboxes or not. Lemme try again.
There are several places in the code where multiple transactions have to be initiated in order to carry out a task. For example, processing a Stacks block requires holding open a transaction to both the Stacks chain state as well as the sortition DB. The problem is that different threads may acquire transactions in conflicting orders, leading to livelocks and/or deadlocks, depending on how a failure to acquire one of the requisite locks is held.
Basically, the way I'd like to get this fixed involves two tasks:
We enumerate all of the legal sequences of transactions that may be required, and require threads to use one of these sequences. We can enforce correct lock order at compile-time by providing a layer of indirection between threads and databases such that threads initiate multi-DB transactions via a permitted sequence.
To facilitate building the above and catching errors, we provide a way to track and publish the backtraces of every caller that tries to open a transaction. Then, if a deadlock or livelock occurs, we can easily determine who the responsible threads are, and which code bodies were involved.
Basically, we need to provide a singleton "database manager" that can ensure that whenever a thread acquires a particular set of transactions, they do so in the right order, and do so atomically (so, all transactions begin, or all abort). The database manager system would log a backtrace of each caller whenever it tries to begin a set of transactions, so if it gets stuck, we can figure out who is blocking it.
How do we build this in a way that minimizes the refactoring burden? I think we'd treat the singleton database manager state as a global within the code, and provide a static API for initiating these transaction sequences and recording (to the global) the callers' backtraces. Specifically, we'd need the following:
pub enum TransactionSequence { /* ... */ }
This would be the set of valid transaction sequences, which we convince ourselves cannot conflict with one another. Fortunately, there aren't that many of them.
pub struct TransactionSet {
pub(crate) txs: HashMap<TransactionSequence, DBTx>
/* ... */
}
This would represent a set of open transactions. It would have a Drop
implementation that rolled back each DBTx<'_>
in ::txs
,
impl TransactionSet {
pub(crate) fn commit(self) -> Result<(), (Self, Error)> {
/* ... */
}
pub(crate) fn rollback(self) -> Result<(), (Self, Error)> {
/* ... */
}
}
The commit()
function which committed and consumed each inner DBTx<'_>
. On failure, it would return the relevant Error
, as well as itself, so the caller can try to commit again until all transactions succeed, or the caller decides to panic.
The rollback()
function would similarly roll back the transactions in a restartable way. The Drop
implementation for TransactionSet
would probably just call .rollback().unwrap()
since there's likely no way to recover from a partially-successful rollback without the caller's context.
pub mod DBManager {
/* This is a sqlite DB connection; see below */
static State : std::sync::Mutex<Option<DBConn>> = std::sync::Mutex::new(None);
pub fn tx_begin_multi(sequence: TransactionSequence) -> Result<TransactionSet, Error> {
/* ... */
}
}
The DBManager
acts as our global singleton. The function tx_begin_multi()
takes one of the valid transaction sequence types as input, and instantiates a TransactionSet
if and only if it was able to open transactions on all of the underlying databases in the right order. Otherwise, no transactions are opened, and an Error
is returned.
The DBManager
would write a backtrace of each caller who calls tx_begin_multi()
to State
(instantiating it lazily if need be). The function TransactionSet::commit()
and TransactionSet::rollback()
functions would also manipulate State
to drop the caller's record. So, the contents of State
would represent the backtraces of every process that has acquired a TransactionSet
, which we can compare against backtraces from callers who are trying and failing to begin transactions.
We would remove all direct calls to tx_begin_immediate()
in the code, and instead have them call DBManager::tx_begin_multi()
.
I've opted to use a Sqlite DB to implement State
, for three reasons:
ptrace(2)
is unavailable (e.g. most containers), we can still diagnose deadlocksState
can instead point to an in-memory DB so we won't incur additional I/O penalties for opening transactions.State
This approach addresses @obycode's original intent by capturing transaction acquisition orders within the type system, and it helps us find regressions and deadlocks more readily than only logging failures to open transactions (since more often than not, we also want to know which process is preventing the caller from opening them).
If, for the moment, we're just interested in improving the debugging process here, we could do something like the following:
db.rs::tx_begin_immediate_sqlite
, before returning Ok(tx)
, do something like:pub fn tx_begin_immediate_sqlite<'a>(conn: &'a mut Connection) -> Result<DBTx<'a>, sqlite_error> {
conn.busy_handler(Some(tx_busy_handler))?;
let tx = Transaction::new(conn, TransactionBehavior::Immediate)?;
let mut locks_table = LOCKS_TABLE.lock().unwrap();
// the debug format for `Connection` includes the path, so deref the Transaction to a Connection
locks_table.insert(format!("{:?}", tx.deref()), format!("{:?}", std::thread::current().name()));
Ok(tx)
}
And then when the busy handler reaches enough wait time, dump the stack and the locks table. That would at least tell us what thread last held the lock in question. This is kind of hacky (it doesn't clear the entries in the table when they close their transactions, it relies on debug formats), but it would require very little refactoring (no refactoring, really: the only lines changed are the ones above, plus an initializer for the static table), and should make the debugging process much easier.
@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:
HashMap
by thread id and/or thread name, so we'll only have the most recent stack trace for each threaddrop()
for DBTx
so that it clears the entry when it's dropped@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:
While it would make the data structure someone clearer if entries are cleared, the most recent holder of a db's lock is sufficient to debug a deadlock. If Thread A is waiting on DB 1's lock, presumably whichever thread is the last to obtain DB 1's lock is still holding it (otherwise, Thread A would be able to get the lock).
- Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread
Threads can hold multiple transactions (indeed, that's the problem), so we'd want the table to map from dbs to threads.
- Implement drop() for DBTx so that it clears the entry when it's dropped
While that could be a relatively straightforward refactor, I don't think its necessary for the reasoning above: the last holder of each lock is really all the information we would need to debug dead locks. This refactor isn't just a matter of implementing drop() for DBTx: DBTx isn't a struct, its just a type alias, and a foreign type, so this would actually require creating a new struct, probably implementing Deref, etc. to simplify the refactor, and then make sure that everywhere in the codebase that currently expects a rusqlite::Transaction
is ready to handle the new type. This would ultimately end up being somewhat similar to the larger refactor Jude proposed above.
@kantai okay after looking into the debug format for tx.deref()
this makes sense to me. I will add something like this. I'd like to get some quick logging improvements in that we can use now, because @jcnelson's suggestion is going to take some time to implement
@jcnelson I appreciate the detailed reply. I like this idea, but I think this would take a while to implement and test adequately, so I'll try to get some simple logging changes merged in the next couple days, so we can be sure to have something for the Nakamoto release
A couple questions/comments on this proposal:
tx_begin_multi()
that if a process currently holds any locks, the function panics. This would force threads to acquire all locks at once. Otherwise, it could still be used like tx_begin_immediate()
is nowTransactionSequence
would work? The only way I'm aware of to guarantee deadlocks can't happen is to force locks to be always be acquired in the same order. For example, if you have locks A, B, and C, a thread can lock A, or B, or A->C, but not B->A. So would this enum would just establish a locking order, and error/panic if that order is violated?@jcnelson I've taken @kantai's suggestion and updated tx_busy_handler()
to print out the HashMap
of all locks held and panic when a deadlock is detected. Let me know if this is acceptable or what changes I should make (return an error instead of panicking? Continue deadlock and just log error?)
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Description
This PR makes two changes to
tx_busy_handler()
, which is passed as a callback to rusqlite and called if waiting too long for a DB locktx_busy_handler()
were debug level, now we have warn/error level messages if we are waiting too longNote that this PR doesn't make deadlocks less likely to occur, it just makes them easier to find, and generates an error when they are detected, which may be recoverable or may crash the application (depending on how the caller handles the error)
Applicable issues
5112
Additional info (benefits, drawbacks, caveats)
I could have added precise timing using
Instant::now()
, but that would require wrappingtx_busy_handler()
in a closure, which creates lifetime issues because it's a callback. This could be addressed by adding theBox
ed callback to theDBTx
type, but this would require changing a lot of code, which I didn't think was worth itChecklist
docs/rpc/openapi.yaml
andrpc-endpoints.md
for v2 endpoints,event-dispatcher.md
for new events)clarity-benchmarking
repobitcoin-tests.yml