Open wadagso-gertjaap opened 2 years ago
I started a few of the possible next steps in this branch - specifically switched to an in-memory RAFT store and returning only block heights from the RAFT state machine, fetching them from the controller and moved publishing them to a separate thread.
The difference between these two commits is astonishing. The only difference is that one doesn't actually broadcast the blocks over the network layer, while the other does m_atomizer_network.broadcast(...)
.
https://github.com/mit-dci/opencbdc-tx/commit/85a1bf87c613a4d36da914d9d2daeda079e415ec (350k TX/s peak) https://github.com/mit-dci/opencbdc-tx/commit/55a49ac4ba55afd7438ec9c4ace5cfb6e4fa9dd1 (200k TX/s peak)
This seems to indicate there's some bottlenecking in the network stack. I don't understand why enabling this broadcasting should slow the whole system down. Especially since it's running on a separate thread.
Question
In the Phase 1 paper of Project Hamilton we concluded that "The atomizer itself is the limiting factor for the overall system as all transaction notifications have to be routed via the atomizer Raft cluster.". Where lies the exact bottleneck of the Atomizer, and can it be alleviated?
Benefit
By knowing where the exact bottleneck of the Atomizer lies, we can look at potential solutions to make the Atomizer perform better, or make informed recommendations about the infrastructure on which the Atomizer can perform in the most optimal way. For instance: if the problem lies in bandwidth between Atomizer RAFT cluster participants, we can recommend having dedicated, high bandwidth low latency network connections between them. If the problem lies in the RAFT log storage, we can recommend using solid state storage. And so on.
Proposed Solution
We made a local benchmarking setup that's isolated in a separate branch. The documentation in that branch includes how to run the benchmark here.
Our initial approach was to start with just logging the event of receiving the transaction, and not further process it. This lead to a peak of about 350k TX/s. We then began moving this point of discarding the transaction further in the atomizer's process. The transaction processing follows the following code paths:
cbdc::atomizer::controller::server_handler
cbdc::atomizer::atomizer_raft::tx_notify
m_complete_txs
herecbdc::atomizer::atomizer_raft::send_complete_txs
aggregate_tx_notify_request
containing multipletx_notify
structs and send them to the raft state machinetx_notify
are then received and processed bycbdc::atomizer::state_machine::commit
cbdc::atomizer::atomizer::insert_complete
is called to verify the attestation being within the STXO Cache Depth (thus not expired), and none of the inputs are in the STXO cache (spent)m_complete_txs
, here.m_complete_txs
vector that happens incbdc::atomizer::atomizer::make_block
herecbdc::atomizer::state_machine::commit
function herecbdc::atomizer::controller::raft_result_handler
here where it's distributed to the watchtowers and shardsWe have been moving the point where the transaction gets discarded further down this code path. The point where we are currently, is where the block and error data are returned from the RAFT state machine (which is after point 7. What we return from
cbdc::atomizer::state_machine::commit
is passed to the callback function of bullet point 8).From the baseline test, commit
9fa80fb
, we disabled block transactions and errors being returned from the raft state machine commit01e578c
.This elevates the peak throughput of the Atomizer from 170k TX/s to 250k TX/s. But errors are no longer reported to the watchtowers and blocks only contain the height.
The assumption after this analysis is that there is a bottleneck in either the RAFT serialization (here) / storage (here) or in the callback functions processing the return value (here) and broadcasting them to the shards and watchtowers.
Further analysis is needed to pinpoint the exact problem and then we can work on a solution that resolves this.
Possible next steps
Things that can be tried to further pinpoint the issue:
01e578c
shows the difference between full blocks and empty blocks being sent over the network - so still the same amount of block objects were broadcasted (with the corresponding locks on the send queue), they just translate to much smaller buffers per object. Given that network traffic happens in a fully separate sending thread it would be good to understand why this impacts the overall system performance.Possible Difficulties
No response
Prior Work
No response
Code of Conduct