Transaction locking fails prematurely due to cost tracker constraints

jstarry commented 9 months ago

Problem

When selecting transactions from a batch for execution in banking stage, there are two preliminary steps: qos cost reservation and global lock acquisition. In the first step, we iterate over a batch of transactions and reserve qos service cost capacity until the capacity is depleted. Any transactions towards the end of the batch list that cannot fit will be marked for retry and skipped for the current batch. Then, in the next step, we again iterate over the batch of transactions and attempt to acquire locks for any un-skipped transactions. If any locks fail, there is leftover cost capacity that could have be used for the previously skipped transactions.

Credit: @crispheaney

Proposed Solution

Combine the loops for qos cost reservation and global lock acquisition and use the following process:

Iterate to next transaction
Reserve transaction cost in qos service
Attempt to acquire locks
If lock acquisition fails, unreserve transaction cost in qos service
Back to top

cc @taozhu-chicago @apfitzge

jstarry commented 9 months ago

I think this solution is better than what I proposed in https://github.com/solana-labs/solana/issues/34807 and this issue can replace that one if we agree this is a better approach.

tao-stones commented 9 months ago

agree this is better approach.

apfitzge commented 9 months ago

I like this approach, but we should be a bit careful with the locks on shared resources: cost_tracker, account_locks

we weant to minimize the time we hold these locks
we need to make sure we are always grabbing these locks in a "safe" order so that we cannot dead-lock (i.e. don't allow thread 1 to take cost_tracker then account_locks, but thread 2 takes account_locks then cost_tracker).

In pursuit of 1, any transaction cost calculation should be done outside of the locks - I'm fairly certain this is already the case. And to be clear, I think we should also not grab/release locks per tx, and do it for the batch like we do now.

tao-stones commented 9 months ago

2. we need to make sure we are always grabbing these locks in a "safe" order so that we cannot dead-lock (i.e. don't allow thread 1 to take cost_tracker then account_locks, but thread 2 takes account_locks then cost_tracker).

Even I don't think we take these two locks together today, I'm not so confident we can keep stable global order for them. Rather like to avoid taking multiple locks.

tao-stones commented 9 months ago

Been back-n-forth on several tries, I found I largely agree with @apfitzge points above. To not to grab two locks at a time, and do it in batch instead of tx-by-tx. I'm leaning to original solution: reserve CU for batch -> acquire locks for batch -> remove CU for transactions failed get accounts lock, and do it in batch. (first two steps are as-is, just adding 3rd step immediately after locking). What are your opinions?

jstarry commented 9 months ago

Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.

apfitzge commented 9 months ago

Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.

This still seems to suffer from the issue though, within the same thread.

Let's say we're close to our limits and we receive a batch [Tx1, Tx2]. Taking both costs would put us over the block-limits, but either can fit. If Tx1 cannot take locks, then we should ideally still reserve & execute Tx2.

If we (reserve qos, take locks, free qos) then neither tx will get executed in this case.

apfitzge commented 9 months ago

It seems we also may want different solutions for old vs new scheduler. Old scheduler has a high probability that locking will fail, but new scheduler have near-zero lock-failures.

So it'd make sense for new scheduler to just lock and then reserve in qos. That could be problematic in the old scheduler where we succeed in locking, which prevents other threads from locking, then cannot reserve in qos.

If we're hitting block-limits this still seems better, because qos getting maxed-out on block-limits means all other threads are blocked. So it seems in either case we want to have the order: (lock, qos reserve, unlock)

wdyt?

tao-stones commented 9 months ago

Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.

This still seems to suffer from the issue though, within the same thread.

Let's say we're close to our limits and we receive a batch [Tx1, Tx2]. Taking both costs would put us over the block-limits, but either can fit. If Tx1 cannot take locks, then we should ideally still reserve & execute Tx2.

If we (reserve qos, take locks, free qos) then neither tx will get executed in this case.

Yea, it does not help on this particular batch, but released cost_tracker space helps other threads to book their txs.

tao-stones commented 9 months ago

So it seems in either case we want to have the order: (lock, qos reserve, unlock)

Yea, I initially considered this (without thinking about new scheduler behavior). I felt better not to messing around accounts_lock too much, but that might not be a good argument anyway.

solana-labs / solana

Transaction locking fails prematurely due to cost tracker constraints #34825

Problem

Proposed Solution