fix: Improve DB deadlock detection/logging

jbencin commented 2 months ago

Description

This PR makes two changes to tx_busy_handler(), which is passed as a callback to rusqlite and called if waiting too long for a DB lock

Improve logging. Previously, all logs in tx_busy_handler() were debug level, now we have warn/error level messages if we are waiting too long
After a certain amount of time has passed (about a minute) we should assume we have a deadlock, stop trying to open the DB, log and return an error

Note that this PR doesn't make deadlocks less likely to occur, it just makes them easier to find, and generates an error when they are detected, which may be recoverable or may crash the application (depending on how the caller handles the error)

Applicable issues

5112

Additional info (benefits, drawbacks, caveats)

I could have added precise timing using Instant::now(), but that would require wrapping tx_busy_handler() in a closure, which creates lifetime issues because it's a callback. This could be addressed by adding the Boxed callback to the DBTx type, but this would require changing a lot of code, which I didn't think was worth it

Checklist

[ ] Test coverage for new or modified code paths
[ ] Changelog is updated
[ ] Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
[ ] New clarity functions have corresponding PR in clarity-benchmarking repo
[ ] New integration test(s) added to bitcoin-tests.yml

jbencin commented 2 months ago

There was a version of tx_busy_handler() in stacks-common also that I had to update

jbencin commented 2 months ago

Rebased and fixed a conflict, please re-approve

jbencin commented 2 months ago

which thread is holding the transaction at the time the caller tries (and fails) to run tx_begin_immediate()? To do this, we'd need to somehow get the lock-holder to print a stack trace

It almost certainly would with this PR. The first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

jcnelson commented 2 months ago

the first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

As mentioned in the call, it's entirely possible that the thread acquires one lock and then spins while trying to acquire the second lock (or just spins in general, without regards to any other locks), which would not result in its stack trace being printed.

jbencin commented 2 months ago

it's entirely possible that the thread acquires one lock and then spins

I see what you mean now, if you don't set the busy handler, rusqlite defaults to returning an SQLITE_BUSY error instead. I greped the code and we don't check for SQLITE_BUSY, so it seems unlikely we spin on this anywhere, but I guess it's possible we spin on all errors somewhere, and I don't know how I'd find that to fix it (I guess I'd have to check all the rusqlite functions which can acquire a lock)

How about I reduce the scope of this PR to just upgrading the logging to warn! and error!?

Also let me know what you think about using a re-entrant mutex as described in #5112

jcnelson commented 2 months ago

Gah, it appears Github ate my original comment. I don't know if it hit your inboxes or not. Lemme try again.

There are several places in the code where multiple transactions have to be initiated in order to carry out a task. For example, processing a Stacks block requires holding open a transaction to both the Stacks chain state as well as the sortition DB. The problem is that different threads may acquire transactions in conflicting orders, leading to livelocks and/or deadlocks, depending on how a failure to acquire one of the requisite locks is held.

Basically, the way I'd like to get this fixed involves two tasks:

We enumerate all of the legal sequences of transactions that may be required, and require threads to use one of these sequences. We can enforce correct lock order at compile-time by providing a layer of indirection between threads and databases such that threads initiate multi-DB transactions via a permitted sequence.
To facilitate building the above and catching errors, we provide a way to track and publish the backtraces of every caller that tries to open a transaction. Then, if a deadlock or livelock occurs, we can easily determine who the responsible threads are, and which code bodies were involved.

Basically, we need to provide a singleton "database manager" that can ensure that whenever a thread acquires a particular set of transactions, they do so in the right order, and do so atomically (so, all transactions begin, or all abort). The database manager system would log a backtrace of each caller whenever it tries to begin a set of transactions, so if it gets stuck, we can figure out who is blocking it.

How do we build this in a way that minimizes the refactoring burden? I think we'd treat the singleton database manager state as a global within the code, and provide a static API for initiating these transaction sequences and recording (to the global) the callers' backtraces. Specifically, we'd need the following:

pub enum TransactionSequence { /* ... */ }

This would be the set of valid transaction sequences, which we convince ourselves cannot conflict with one another. Fortunately, there aren't that many of them.

pub struct TransactionSet {
   pub(crate) txs: HashMap<TransactionSequence, DBTx>
   /* ... */
}

This would represent a set of open transactions. It would have a Drop implementation that rolled back each DBTx<'_> in ::txs,

impl TransactionSet {
   pub(crate) fn commit(self) -> Result<(), (Self, Error)> {
      /* ... */
   }

   pub(crate) fn rollback(self) -> Result<(), (Self, Error)> {
      /* ... */
   }
}

The commit() function which committed and consumed each inner DBTx<'_>. On failure, it would return the relevant Error, as well as itself, so the caller can try to commit again until all transactions succeed, or the caller decides to panic.

The rollback() function would similarly roll back the transactions in a restartable way. The Drop implementation for TransactionSet would probably just call .rollback().unwrap() since there's likely no way to recover from a partially-successful rollback without the caller's context.

pub mod DBManager {
   /* This is a sqlite DB connection; see below */
   static State : std::sync::Mutex<Option<DBConn>> = std::sync::Mutex::new(None);

   pub fn tx_begin_multi(sequence: TransactionSequence) -> Result<TransactionSet, Error> {
      /* ... */
   }
}

The DBManager acts as our global singleton. The function tx_begin_multi() takes one of the valid transaction sequence types as input, and instantiates a TransactionSet if and only if it was able to open transactions on all of the underlying databases in the right order. Otherwise, no transactions are opened, and an Error is returned.

The DBManager would write a backtrace of each caller who calls tx_begin_multi() to State (instantiating it lazily if need be). The function TransactionSet::commit() and TransactionSet::rollback() functions would also manipulate State to drop the caller's record. So, the contents of State would represent the backtraces of every process that has acquired a TransactionSet, which we can compare against backtraces from callers who are trying and failing to begin transactions.

We would remove all direct calls to tx_begin_immediate() in the code, and instead have them call DBManager::tx_begin_multi().

I've opted to use a Sqlite DB to implement State, for three reasons:

In debug mode, we can persist caller backtraces to disk. This way, even if the node freezes, or if the node is running in an environment where ptrace(2) is unavailable (e.g. most containers), we can still diagnose deadlocks
In production, State can instead point to an in-memory DB so we won't incur additional I/O penalties for opening transactions.
If more performance is desired, it's easy to simply disable the use of State

This approach addresses @obycode's original intent by capturing transaction acquisition orders within the type system, and it helps us find regressions and deadlocks more readily than only logging failures to open transactions (since more often than not, we also want to know which process is preventing the caller from opening them).

kantai commented 2 months ago

If, for the moment, we're just interested in improving the debugging process here, we could do something like the following:

Add a static, thread-safe, (i.e., locked) hashmap, which maps from db file name(-ish) to thread name.
In db.rs::tx_begin_immediate_sqlite, before returning Ok(tx), do something like:

pub fn tx_begin_immediate_sqlite<'a>(conn: &'a mut Connection) -> Result<DBTx<'a>, sqlite_error> {
    conn.busy_handler(Some(tx_busy_handler))?;
    let tx = Transaction::new(conn, TransactionBehavior::Immediate)?;
    let mut locks_table = LOCKS_TABLE.lock().unwrap();
    // the debug format for `Connection` includes the path, so deref the Transaction to a Connection
    locks_table.insert(format!("{:?}", tx.deref()), format!("{:?}", std::thread::current().name()));
    Ok(tx)
}

And then when the busy handler reaches enough wait time, dump the stack and the locks table. That would at least tell us what thread last held the lock in question. This is kind of hacky (it doesn't clear the entries in the table when they close their transactions, it relies on debug formats), but it would require very little refactoring (no refactoring, really: the only lines changed are the ones above, plus an initializer for the static table), and should make the debugging process much easier.

jbencin commented 2 months ago

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread
Implement drop() for DBTx so that it clears the entry when it's dropped

kantai commented 2 months ago

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

While it would make the data structure someone clearer if entries are cleared, the most recent holder of a db's lock is sufficient to debug a deadlock. If Thread A is waiting on DB 1's lock, presumably whichever thread is the last to obtain DB 1's lock is still holding it (otherwise, Thread A would be able to get the lock).

Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread

Threads can hold multiple transactions (indeed, that's the problem), so we'd want the table to map from dbs to threads.

Implement drop() for DBTx so that it clears the entry when it's dropped

While that could be a relatively straightforward refactor, I don't think its necessary for the reasoning above: the last holder of each lock is really all the information we would need to debug dead locks. This refactor isn't just a matter of implementing drop() for DBTx: DBTx isn't a struct, its just a type alias, and a foreign type, so this would actually require creating a new struct, probably implementing Deref, etc. to simplify the refactor, and then make sure that everywhere in the codebase that currently expects a rusqlite::Transaction is ready to handle the new type. This would ultimately end up being somewhat similar to the larger refactor Jude proposed above.

jbencin commented 2 months ago

@kantai okay after looking into the debug format for tx.deref() this makes sense to me. I will add something like this. I'd like to get some quick logging improvements in that we can use now, because @jcnelson's suggestion is going to take some time to implement

jbencin commented 2 months ago

@jcnelson I appreciate the detailed reply. I like this idea, but I think this would take a while to implement and test adequately, so I'll try to get some simple logging changes merged in the next couple days, so we can be sure to have something for the Nakamoto release

A couple questions/comments on this proposal:

I think for this to work we'd need a check in tx_begin_multi() that if a process currently holds any locks, the function panics. This would force threads to acquire all locks at once. Otherwise, it could still be used like tx_begin_immediate() is now
Are there other locks (besides DB) in the application that could potentially lead to a deadlock?
How exactly would TransactionSequence would work? The only way I'm aware of to guarantee deadlocks can't happen is to force locks to be always be acquired in the same order. For example, if you have locks A, B, and C, a thread can lock A, or B, or A->C, but not B->A. So would this enum would just establish a locking order, and error/panic if that order is violated?

jbencin commented 2 months ago

@jcnelson I've taken @kantai's suggestion and updated tx_busy_handler() to print out the HashMap of all locks held and panic when a deadlock is detected. Let me know if this is acceptable or what changes I should make (return an error instead of panicking? Continue deadlock and just log error?)

blockstack-devops commented 2 weeks ago

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

stacks-network / stacks-core