Analyze Atomizer's performance bottleneck

Question

In the Phase 1 paper of Project Hamilton we concluded that "The atomizer itself is the limiting factor for the overall system as all transaction notifications have to be routed via the atomizer Raft cluster.". Where lies the exact bottleneck of the Atomizer, and can it be alleviated?

Benefit

By knowing where the exact bottleneck of the Atomizer lies, we can look at potential solutions to make the Atomizer perform better, or make informed recommendations about the infrastructure on which the Atomizer can perform in the most optimal way. For instance: if the problem lies in bandwidth between Atomizer RAFT cluster participants, we can recommend having dedicated, high bandwidth low latency network connections between them. If the problem lies in the RAFT log storage, we can recommend using solid state storage. And so on.

Proposed Solution

We made a local benchmarking setup that's isolated in a separate branch. The documentation in that branch includes how to run the benchmark here.

Our initial approach was to start with just logging the event of receiving the transaction, and not further process it. This lead to a peak of about 350k TX/s. We then began moving this point of discarding the transaction further in the atomizer's process. The transaction processing follows the following code paths:

Transaction is received by cbdc::atomizer::controller::server_handler
Transaction is received by cbdc::atomizer::atomizer_raft::tx_notify
- This method is called by all individual shards, with attestations for their respective input
- This method will place the transaction in a mutex guarded vector of complete transactions m_complete_txs here
Transactions are compiled into batches by cbdc::atomizer::atomizer_raft::send_complete_txs
- This method will make an aggregate_tx_notify_request containing multiple tx_notify structs and send them to the raft state machine
Batches of tx_notify are then received and processed by cbdc::atomizer::state_machine::commit
For each transaction, cbdc::atomizer::atomizer::insert_complete is called to verify the attestation being within the STXO Cache Depth (thus not expired), and none of the inputs are in the STXO cache (spent)
- If all these checks succeed, the transaction is added to the STXO cache, and to a vector of transactions that should be included in the next block, m_complete_txs, here.
The transaction is then included in the block by the swapping of the m_complete_txs vector that happens in cbdc::atomizer::atomizer::make_block here
The block is then returned to the cbdc::atomizer::state_machine::commit function here
The block is then returned to the raft callback in cbdc::atomizer::controller::raft_result_handler here where it's distributed to the watchtowers and shards

We have been moving the point where the transaction gets discarded further down this code path. The point where we are currently, is where the block and error data are returned from the RAFT state machine (which is after point 7. What we return from cbdc::atomizer::state_machine::commit is passed to the callback function of bullet point 8).

From the baseline test, commit 9fa80fb, we disabled block transactions and errors being returned from the raft state machine commit 01e578c.

This elevates the peak throughput of the Atomizer from 170k TX/s to 250k TX/s. But errors are no longer reported to the watchtowers and blocks only contain the height.

The assumption after this analysis is that there is a bottleneck in either the RAFT serialization (here) / storage (here) or in the callback functions processing the return value (here) and broadcasting them to the shards and watchtowers.

Further analysis is needed to pinpoint the exact problem and then we can work on a solution that resolves this.

Possible next steps

Things that can be tried to further pinpoint the issue:

Either of these two paths can be investigated:
1. Look at the efficiency of RAFT serialization and storage, replace with an optimal implementation:
  - Replace the RAFT Log store with an in-memory version to rule out the RAFT storage from being the bottleneck;
  - Look at the efficiency of the RAFT serialization (specifically the use of a temporary local buffer)
2. Side-step the impact of RAFT serialization for return values by minimizing the return value size (for instance: store blocks in a member variable of the atomizer class and fetch them from the controller, and just return the height in stead of returning the whole block from the state machine)
Look at how the network stack impacts performance, specifically when sending and receiving lots of data at once. The experiment in commit 01e578c shows the difference between full blocks and empty blocks being sent over the network - so still the same amount of block objects were broadcasted (with the corresponding locks on the send queue), they just translate to much smaller buffers per object. Given that network traffic happens in a fully separate sending thread it would be good to understand why this impacts the overall system performance.

Possible Difficulties

No response

Prior Work

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

mit-dci / opencbdc-tx