stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.11k stars 968 forks source link

Parallel Soroban application prototype #4366

Open dmkozh opened 1 week ago

dmkozh commented 1 week ago

This is the tracking/discussion issue for work on applying Soroban phase in parallel using multithreading. The purpose of the prototype is to come up with a Core build that can be used for the basic benchmarking of the parallel application, while also being able to reuse most of the work if the prototype looks good.

The scope of the prototype can be limited to the following:

Notably, no changes to the transaction queue and overlay are necessary initially, even though we might need to revisit these for the actual release. In case if the prototype works as expected, the work will be presented as a CAP.

The initial implementation plan is as follows:

MonsieurNicolas commented 1 week ago

Couple questions on the xdr structure.

In

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

Why do we need each thread to have a "DependentTxCluster"? I would expect at that layer (consensus and execution), to only see the actual schedule (how we got there from a clustering point of view is an implementation detail). So each stage being made up of threads that themselves have an array of transactions.

We'd end up with

typedef TransactionEnvelope TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

Second question is: instead of

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

struct ParallelTxsComponent
{
  int64* baseFee;
  ParallelTxExecutionStage executionStages<>;
};

did you consider

typedef TransactionEnvelope DependentTxCluster<>;
typedef DependentTxCluster TxExecutionThread<>;
typedef TxExecutionThread ParallelTxExecutionStage<>;

struct ParallelTxsComponent
{
  int64* baseFee;
  ParallelTxExecutionStage executionStage;
};

and just allow for multiple transaction phases of type "1", "Parallel tx" in a generalized transaction set. That way we won't have to do anything special about base fees that could be different between stages.

dmkozh commented 1 week ago

Why do we need each thread to have a "DependentTxCluster"? I would expect at that layer (consensus and execution), to only see the actual schedule (how we got there from a clustering point of view is an implementation detail).

The idea here is not to expose the implementation details, but to allow for more efficient replay of these tx sets given more then numThreads cores (maybe even during apply in case if this ever any benefits to the validator performance). I realize that this might be debatable as it can be treated as 'preliminary optimization', but OTOH it would be pretty annoying to go back and re-compute data deps in case if we ever want to do that with old tx sets. So I would rather call this 'future-proofing'. But I'm not too settled on this; if you think that an additional few hundred bytes/tx set is not worth the potential catchup speedup, I can get behind that as well.

and just allow for multiple transaction phases of type "1"

I did consider this among the few options we have. While this looks more elegant XDR-wise, it has some issues:

dmkozh commented 6 days ago

but OTOH it would be pretty annoying to go back and re-compute data deps

I've thought a bit more about this and I guess it won't be too hard (specifically, not computationally hard) to infer the clusters before the apply stage if we ever want to. As long as we don't observe TTL changes within a thread, we should be fine. I'll work on removing the clusters, which will simplify the structure a bit.

MonsieurNicolas commented 6 days ago

We don't really want to set different base fees per stage

I am pretty sure we will need this fairly rapidly in follow up versions. As soon as the number of transactions that cause a lot of contention is high, you'll have to place them in a step with low or no parallelization: at that point you'll have to decide on increasing the size of the non parallel step (excluding a lot of transactions that can execute in parallel) or exclude other anti-parallelism transactions. In both cases, you can't just apply the same base fee to parallel and anti-parallel transactions: anti-parallel transactions got to be more expensive in those type of situations than the "loss of opportunity" in parallel ones, intuitively I think the base fee for anti-parallel transactions should be such that steps end up with the same total inclusion fee (if they have the same max cpu instruction per thread). That and we may have situations where we'll want CAP-0042 like behavior between groups (because of trading or other high level scenarios), and steps will be the natural place to make this happen (this is probably further out).

dmkozh commented 6 days ago

As soon as the number of transactions that cause a lot of contention is high,

That's not a given; I actually think that it's pretty unlikely that we'll ever have high contention that would also be detrimental to parallelization. Specifically, the transactions have to be transitively conflicting in order to be truly detrimental to parallelism, in which case there is really no good way to pinpoint 'anti-parallel' transactions.

With the current approach if several entries want to write the same entry, they will in the worst case take up one thread, separated between different stages. That's why I'm suggesting that if we wanted to do more granular surge pricing, we'd need to do that per cluster of dependent transactions, rather than per stage. But I'm really hesitant to design around that until we have a good evidence that it would solve a real issue.

It's also pretty hard to evaluate the 'lost opportunity'. There is technically nothing wrong with a single thread being occupied by dependent transactions only; even if they were independent, we still wouldn't be able to run them in parallel if we don't have enough threads. Dependent transactions do decrease the efficiency of utilizing the ledger space, but they do that in a pretty arbitrary fashion depending on overall tx set composition (i.e. they decrease the efficiency of bin packing to some degree).

I think clusters of dependent transactions are very similar to simply large transactions: we will usually have more efficient utilization with smaller transactions than with the larger ones. A cluster of dependent transactions can basically be treated as a single large transaction that needs the sum of its instructions, so if the sum is below the per-transaction limit, this is really not any different from just adding a large transaction (and we don't charge proportional inclusion fees ATM). If the sum is above the limit then there is more impact, but it's hard to evaluate it, and it's definitely not a step-function that becomes non-zero only after reaching the per-tx instructions limit.

dmkozh commented 6 days ago

What I think we could try is reducing the fees for the transactions that didn't participate in any cluster in case if we didn't include a transaction only because it happened to belong to a too large dependency cluster. That still wouldn't be a per-stage fee though. I'll think more as to how exactly that could be implemented, but for the prototype it would really make no difference, so that could wait.