Parallel Soroban application prototype

dmkozh commented 1 week ago

This is the tracking/discussion issue for work on applying Soroban phase in parallel using multithreading. The purpose of the prototype is to come up with a Core build that can be used for the basic benchmarking of the parallel application, while also being able to reuse most of the work if the prototype looks good.

The scope of the prototype can be limited to the following:

Transaction set: add an ability to distribute transaction sets that contain the parallel application schedule
Application logic: add a capability to run the transactions from the transaction set above in parallel

Notably, no changes to the transaction queue and overlay are necessary initially, even though we might need to revisit these for the actual release. In case if the prototype works as expected, the work will be presented as a CAP.

The initial implementation plan is as follows:

[ ] Add support for tx sets with the 'parallel' Soroban phase
- [x] Make the necessary XDR changes
- [ ] Refactor TxSetFrame to support the XDR changes for parallel tx sets
- [ ] Add a simple partitioning algorithm, based on the prototype: https://github.com/dmkozh/txset_partition/blob/main/txset_partition.cpp
[ ] Add a capability for Soroban transactions to be applied in parallel
- [ ] Pass the storage snapshots to the execution threads instead of operating directly on ledger transactions
- [ ] Reconcile the ledger changes, including TTL
- [ ] Make necessary fixes to maintain the base protocol features (charging the fees, emitting tx meta etc.)
[ ] Create a benchmark
- [ ] Add/modify the Soroban loadgen to produce traffic with a given degree of I/O conflicts between transactions (at least 'no conflicts' and 'some conflicts' modes)
- [ ] Create unit tests for e2e stress-testing
- [ ] Create a supercluster mission for testing with a few validators at once

MonsieurNicolas commented 1 week ago

Couple questions on the xdr structure.

In

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

Why do we need each thread to have a "DependentTxCluster"? I would expect at that layer (consensus and execution), to only see the actual schedule (how we got there from a clustering point of view is an implementation detail). So each stage being made up of threads that themselves have an array of transactions.

We'd end up with

typedef TransactionEnvelope TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

Second question is: instead of

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

struct ParallelTxsComponent
{
  int64* baseFee;
  ParallelTxExecutionStage executionStages<>;
};

did you consider

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

struct ParallelTxsComponent
{
  int64* baseFee;
  ParallelTxExecutionStage executionStage;
};

and just allow for multiple transaction phases of type "1", "Parallel tx" in a generalized transaction set. That way we won't have to do anything special about base fees that could be different between stages.

dmkozh commented 1 week ago

Why do we need each thread to have a "DependentTxCluster"? I would expect at that layer (consensus and execution), to only see the actual schedule (how we got there from a clustering point of view is an implementation detail).

The idea here is not to expose the implementation details, but to allow for more efficient replay of these tx sets given more then numThreads cores (maybe even during apply in case if this ever any benefits to the validator performance). I realize that this might be debatable as it can be treated as 'preliminary optimization', but OTOH it would be pretty annoying to go back and re-compute data deps in case if we ever want to do that with old tx sets. So I would rather call this 'future-proofing'. But I'm not too settled on this; if you think that an additional few hundred bytes/tx set is not worth the potential catchup speedup, I can get behind that as well.

and just allow for multiple transaction phases of type "1"

I did consider this among the few options we have. While this looks more elegant XDR-wise, it has some issues:

Currently phases are sequential and are applied in the same order as provided in the tx set. This is fine because phase 0 is always classic, and phase 1 is always Soroban. If we implemented Soroban stages as phases and kept the current behavior, then it either becomes too easy to manipulate the tx execution order (just put txs into phases in the sequence you like), or we'll need shuffle Soroban phases as I do with stages now, which would make for inconsistent behavior. To be clear, with the current structure we enforce hash order between stages/threads/clusters/txs (to prevent malleability), and we do hash-based shuffling of stages and clusters, which is not perfect, but at least makes it significantly harder to manipulate the transaction order.
We don't really want to set different base fees per stages. The purpose of stages is to deal with data conflicts, not to do surge pricing. If we want to build data-dependent surge pricing into protocol from the get-go, we would rather need to move the base fee to the cluster/thread level, and even then it's kind of clunky (because clusters of txs that access the same entry may be spread out over several stages). But ultimately I don't think we need to get that sophisticated in the initial protocol; so far we have no evidence that data conflicts will become a limiting factor and separate surge pricing would complicate the protocol significantly.

dmkozh commented 6 days ago

but OTOH it would be pretty annoying to go back and re-compute data deps

I've thought a bit more about this and I guess it won't be too hard (specifically, not computationally hard) to infer the clusters before the apply stage if we ever want to. As long as we don't observe TTL changes within a thread, we should be fine. I'll work on removing the clusters, which will simplify the structure a bit.

MonsieurNicolas commented 6 days ago

We don't really want to set different base fees per stage

I am pretty sure we will need this fairly rapidly in follow up versions. As soon as the number of transactions that cause a lot of contention is high, you'll have to place them in a step with low or no parallelization: at that point you'll have to decide on increasing the size of the non parallel step (excluding a lot of transactions that can execute in parallel) or exclude other anti-parallelism transactions. In both cases, you can't just apply the same base fee to parallel and anti-parallel transactions: anti-parallel transactions got to be more expensive in those type of situations than the "loss of opportunity" in parallel ones, intuitively I think the base fee for anti-parallel transactions should be such that steps end up with the same total inclusion fee (if they have the same max cpu instruction per thread). That and we may have situations where we'll want CAP-0042 like behavior between groups (because of trading or other high level scenarios), and steps will be the natural place to make this happen (this is probably further out).

dmkozh commented 6 days ago

As soon as the number of transactions that cause a lot of contention is high,

That's not a given; I actually think that it's pretty unlikely that we'll ever have high contention that would also be detrimental to parallelization. Specifically, the transactions have to be transitively conflicting in order to be truly detrimental to parallelism, in which case there is really no good way to pinpoint 'anti-parallel' transactions.

With the current approach if several entries want to write the same entry, they will in the worst case take up one thread, separated between different stages. That's why I'm suggesting that if we wanted to do more granular surge pricing, we'd need to do that per cluster of dependent transactions, rather than per stage. But I'm really hesitant to design around that until we have a good evidence that it would solve a real issue.

It's also pretty hard to evaluate the 'lost opportunity'. There is technically nothing wrong with a single thread being occupied by dependent transactions only; even if they were independent, we still wouldn't be able to run them in parallel if we don't have enough threads. Dependent transactions do decrease the efficiency of utilizing the ledger space, but they do that in a pretty arbitrary fashion depending on overall tx set composition (i.e. they decrease the efficiency of bin packing to some degree).

I think clusters of dependent transactions are very similar to simply large transactions: we will usually have more efficient utilization with smaller transactions than with the larger ones. A cluster of dependent transactions can basically be treated as a single large transaction that needs the sum of its instructions, so if the sum is below the per-transaction limit, this is really not any different from just adding a large transaction (and we don't charge proportional inclusion fees ATM). If the sum is above the limit then there is more impact, but it's hard to evaluate it, and it's definitely not a step-function that becomes non-zero only after reaching the per-tx instructions limit.

dmkozh commented 6 days ago

What I think we could try is reducing the fees for the transactions that didn't participate in any cluster in case if we didn't include a transaction only because it happened to belong to a too large dependency cluster. That still wouldn't be a per-stage fee though. I'll think more as to how exactly that could be implemented, but for the prototype it would really make no difference, so that could wait.

stellar / stellar-core

Parallel Soroban application prototype #4366