Open jstarry opened 9 months ago
I think this solution is better than what I proposed in https://github.com/solana-labs/solana/issues/34807 and this issue can replace that one if we agree this is a better approach.
agree this is better approach.
I like this approach, but we should be a bit careful with the locks on shared resources: cost_tracker, account_locks
In pursuit of 1, any transaction cost calculation should be done outside of the locks - I'm fairly certain this is already the case. And to be clear, I think we should also not grab/release locks per tx, and do it for the batch like we do now.
2. we need to make sure we are always grabbing these locks in a "safe" order so that we cannot dead-lock (i.e. don't allow thread 1 to take cost_tracker then account_locks, but thread 2 takes account_locks then cost_tracker).
Even I don't think we take these two locks together today, I'm not so confident we can keep stable global order for them. Rather like to avoid taking multiple locks.
Been back-n-forth on several tries, I found I largely agree with @apfitzge points above. To not to grab two locks at a time, and do it in batch instead of tx-by-tx. I'm leaning to original solution: reserve CU for batch -> acquire locks for batch -> remove CU for transactions failed get accounts lock, and do it in batch. (first two steps are as-is, just adding 3rd step immediately after locking). What are your opinions?
Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.
Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.
This still seems to suffer from the issue though, within the same thread.
Let's say we're close to our limits and we receive a batch [Tx1, Tx2]. Taking both costs would put us over the block-limits, but either can fit. If Tx1 cannot take locks, then we should ideally still reserve & execute Tx2.
If we (reserve qos, take locks, free qos) then neither tx will get executed in this case.
It seems we also may want different solutions for old vs new scheduler. Old scheduler has a high probability that locking will fail, but new scheduler have near-zero lock-failures.
So it'd make sense for new scheduler to just lock and then reserve in qos. That could be problematic in the old scheduler where we succeed in locking, which prevents other threads from locking, then cannot reserve in qos.
If we're hitting block-limits this still seems better, because qos getting maxed-out on block-limits means all other threads are blocked. So it seems in either case we want to have the order: (lock, qos reserve, unlock)
wdyt?
Yeah taking both locks is probably too risky.. How about we first reserve as much as cost as we need in qos and then keep track of how much of that cost we've used so far as we do the transaction account locking loop? After locking, we release any reserved cost back to the qos cost tracker for other workers.
This still seems to suffer from the issue though, within the same thread.
Let's say we're close to our limits and we receive a batch [Tx1, Tx2]. Taking both costs would put us over the block-limits, but either can fit. If Tx1 cannot take locks, then we should ideally still reserve & execute Tx2.
If we (reserve qos, take locks, free qos) then neither tx will get executed in this case.
Yea, it does not help on this particular batch, but released cost_tracker space helps other threads to book their txs.
So it seems in either case we want to have the order: (lock, qos reserve, unlock)
Yea, I initially considered this (without thinking about new scheduler behavior). I felt better not to messing around accounts_lock too much, but that might not be a good argument anyway.
Problem
When selecting transactions from a batch for execution in banking stage, there are two preliminary steps: qos cost reservation and global lock acquisition. In the first step, we iterate over a batch of transactions and reserve qos service cost capacity until the capacity is depleted. Any transactions towards the end of the batch list that cannot fit will be marked for retry and skipped for the current batch. Then, in the next step, we again iterate over the batch of transactions and attempt to acquire locks for any un-skipped transactions. If any locks fail, there is leftover cost capacity that could have be used for the previously skipped transactions.
Credit: @crispheaney
Proposed Solution
Combine the loops for qos cost reservation and global lock acquisition and use the following process:
cc @taozhu-chicago @apfitzge