near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 618 forks source link

Idea: could we not block chunk apply pipeline on committing the chunk application results to storage? #11118

Open nagisa opened 5 months ago

nagisa commented 5 months ago

Today the transaction runtime is roughly a pipeline that executes its tasks in a certain sequence. Among other things when applying a chunk it will first execute the transactions and receipts and then commit all the state changes to the storage before moving onto the next chunk.

In practice however, we don't necessarily need to block processing and handling of the next block on the state commits for the previous block. Technically we could keep a double-buffering-like scheme for TrieUpdates (TrieUpdates accumulate the state changes in a transaction), where the storage accesses for the current chunk goes though a TrieUpdate (current_block) => TrieUpdate (previous_block) => Store rather than TrieUpdate (current_block) => Store as it does currently.

That gives us an opportunity to commit TrieUpdate (previous_block) into the Store in a background thread while the transactions for the current_block are being executed. The storage reads still go through a TrieUpdate (previous_block), ensuring a consistent view of the storage. Every time a block processing begins, we rebuild the storage stack to add a new TrieUpdate (next_block) on top of TrieUpdate (current_block) and remove the now-offthread-committed TrieUpdate (previous_block) (we'd have to block here if the commit is not over yet.)


Doing this may make individual reads and writes slightly more expensive, but it will remove a sometimes 200+ms commit from the critical execution path.

nagisa commented 5 months ago

Q: is this compatible with memtrie?

bowenwang1996 commented 5 months ago

@nagisa when you say "commit changes", what exactly do you refer to? There are two levels of commitment here. Conceptually, after each chunk is applied, a new state root is computed as a result of the chunk application and the chunk producer "commit" to the new state root. This is needed for stateless validation so that validators can validate chunk with state witness against the new state root. However, there is a commit to the database on the implementation level that indeed does not need to be blocking. If I understand correctly, you referred to StoreUpdate::commit here which may take a relatively long time. I agree that it could be done in the background without blocking the progress of processing new chunks.

nagisa commented 5 months ago

when you say "commit changes", what exactly do you refer to?

The functions called commit. I know this answer is still somewhat handwavey, since we do have StoreUpdate::commit and ChainUpdate::commit, but largely I'm concerned about the Store I/O and nothing else in this issue (for it is the most expensive part by far.)

Conceptually, after each chunk is applied, a new state root is computed as a result of the chunk application and the chunk producer "commit" to the new state root. This is needed for stateless validation so that validators can validate chunk with state witness against the new state root.

Yeah, I believe we compute a new state root as a prerequisite step before a transaction is committed to the store (trie node hashes need to be known to write them out...) It wouldn't be particularly terrible to keep the computation of the new state root hash in the main pipeline, but that raises another thought -- can we move to an overall more asynchronous design?

As in we currently have N steps that we execute in sequence, but for example we could have conceptual Futures[^1] that resolve when: the chunk is applied; when a transaction is applied; when a new state root has been computed; when a store update has been committed; etc. etc.

That sort of restructuring would allow stateless validation (and other components too) wait on a specific checkpoint of the computation without entrenching itself with the implementation details of the transaction runtime. And in enabling such separation of concerns, we'd have fewer constraints in how we implement the transaction runtime.

[^1]: I'm not saying that we ought to use Futures specifically; we could also use signals, events, callbacks or some other mechanism to inform interested parties of what's happening within the transaction runtime.

Ekleog-NEAR commented 2 months ago

cc https://github.com/near/nearcore/issues/11808

Ekleog-NEAR commented 1 month ago

On testnet over a 1hr30min run, I could see a single ChainStoreUpdate::commit span that would take more than 20ms.

My guess is that the high values we could see here previously were due to lock contention with reads, that disappeared with recent updates.

However, this is AFAICT on a node with memtrie disabled:

  "load_mem_tries_for_shards": [],
  "load_mem_tries_for_tracked_shards": false,

This last point is a bit weird to me, as memtrie would be the thing I'd have expected to reduce the read load.

@nagisa Do you remember where/how you found the 200ms+ commit you mentioned in the first post?

On my end I'll likely try to reproduce the experiment with mainnet tomorrow, now I have a simple process to get that number out, and consider the next steps then.

nagisa commented 1 month ago

It was a routine on mainnet specifically on shards 2 and 3 during higher load (i.e. When chunks were stuffed full with receipts)

The commit time was directly proportional to the amount of changes to storage.

6 Aug 2024 23:18:41 Ekleog-NEAR @.***>:

On testnet over a 1hr30min run, I could see a single ChainStoreUpdate::commit span that would take more than 20ms.

My guess is that the high values we could see here previously were due to lock contention with reads, that disappeared with recent updates.

However, this is AFAICT on a node with memtrie disabled:

  • "load_mem_tries_for_shards": [], "load_mem_tries_for_tracked_shards": false,
  • This last point is a bit weird to me, as memtrie would be /the/ thing I'd have expected to reduce the read load.

@nagisa[https://github.com/nagisa] Do you remember where/how you found the 200ms+ commit you mentioned in the first post?

On my end I'll likely try to reproduce the experiment with mainnet tomorrow, now I have a simple process to get that number out, and consider the next steps then.

— Reply to this email directly, view it on GitHub[https://github.com/near/nearcore/issues/11118#issuecomment-2272076181], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAFFZUVZFB5SXXQ5YZZGK3DZQEVR5AVCNFSM6AAAAABGOYSCCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZSGA3TMMJYGE]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AAFFZUQI2J5YAYZLNW43XR3ZQEVR5A5CNFSM6AAAAABGOYSCCGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUHNUQZK.gif]

Ekleog-NEAR commented 1 month ago

The results of the 1.5hrs experiment on mainnet with memtrie disabled are roughly:

Soon to come, the same experiment with memtrie enabled, but I'll have to spin up a VM with lots of memory first: n2d-standard-8 is not enough to load memtrie

Ekleog-NEAR commented 1 month ago

After ~4.5hrs of mainnet with memtrie enabled (it took longer to get noticeably high results, for some reason rerunning the grafana query got no new result for a while but after 4hrs older results appeared :shrug:), I got the following results:

So it seems like this project is even more important now that memtrie landed, than before. Probably because there was no lock contention between reads and writes, and it's just that writes take a long time. So now that reads are faster, the write time is more noticeable.

Disclaimer: this is based on the slowest-reported traces of the two functions, it's not necessarily going to be 25% overall perf improvement. But the p99 should improve by ~25%, which could allow us to lower gas costs hopefully.


With this in mind, I'll now focus on actually putting in a background thread: