Closed Lederstrumpf closed 1 year ago
- version the API, and then either fail gracefully on a mismatch or retain older versions
This is definitely the right way to do it. In this case we consciously decided to not do it since MMR pallet was not yet stable and we didn't want to carry around old versions of the API which were never used in production.
Since we've made multiple breaking changes to it lately, I am definitely leaning towards resetting the pallet once these changes are all deployed and thus, start fresh on a clean & stable MMR.
With the new MMR client gadget I'm not sure if we can reset the pallet (If resetting the pallet = setting NUM_MMR_LEAVES to 0). The MMR gadget will take the first block with MMR and compute leaf numbers based on it. After that, if we set NUM_MMR_LEAVES to 0 later, I think we'll get errors for canonicalization and pruning.
We should then coordinate to reset the pallet in the same runtime upgrade that brings in the new API that the gadget relies on.
The new API was already released. As discussed offline, we need to see if we can support resets in the gadget.
The idea that I'm thinking about in order to make the MMR client gadget support pallet resets is to change the way we save mmr nodes in offchain DB. Now we are saving them node by node by their position in the MMR. The option that I'm working on is to save each subtree by the block that introduced them. For example
For the tree:
2
/ \
1 3
we are now saving the data for each node at the following keys:
1_[genesis_hash]
2_[block_1_hash]
3_[block_1_hash]
I'm trying to save:
1_[genesis_hash] -> the subtree containing node 1 (the subtree added by block 1)
2_[block_1_hash] -> the subtree containing nodes 2 and 3 (the subtree added by block 2)
And then these get canonicalized by the client gadget.
If we store subtrees by the block number that introduced them, we don't care about the position of the nodes in the mmr tree anymore, so when the pallet resets, we can continue to canonicalize by block number. The client gadget won't be impacted by the fact that at block x
the pallet got reset.
However the disadvantage would be that when we need to retrieve a node in order to generate a proof, we would retrieve the entire subtree introduced by a block, which could be less efficient.
In the meanwhile I'm still thinking if there could be other simpler or more efficient solutions. The easiest solution would probably be to call mmr_leaf_count
from the gadget at every finality notification, but I don't think it would be very efficient.
I'm trying to save:
1_[genesis_hash] -> the subtree containing node 1 (the subtree added by block 1) 2_[block_1_hash] -> the subtree containing nodes 2 and 3 (the subtree added by block 2)
However the disadvantage would be that when we need to retrieve a node in order to generate a proof, we would retrieve the entire subtree introduced by a block, which could be less efficient.
Efficiency is much more important for on-chain activity than off-chain activity. Making above change would add more complexity to the process of "adding a node to the mmr" which happens multiple times per block during block execution. Currently, each node is directly added to offchain db (single entry per node); in your proposal (entry is vec of nodes per block), we'd either have:
mmr.finalize()
move the vec from runtime storage to offchain db in a single read-onchain + write-offchain + kill-onchain-temp-itemBecause of onchain db cache, I think the second option is actually faster than what we currently have as it does everything in memory, and hopefully doesn't even touch the underlying onchain storage db since by the end of the process, nothing ultimately changes to onchain storage; and for offchain db we get a single write instead of one per node.
If offchain db also had(has?) cache, then what we have now is the most efficient.
I would actually not change anything here, but simply add a new pallet-mmr
runtime API pallet_genesis_block_num()
and use that.
The easiest solution would probably be to call
mmr_leaf_count()
from the gadget at every finality notification, but I don't think it would be very efficient.
Or this ^ - which is equivalent when we check count only ever goes up. Efficiency-wise, it's no problem for the gadget to call a runtime api per finality notification, or even once per block.
Because of onchain db cache, I think the second option is actually faster than what we currently have as it does everything in memory, and hopefully doesn't even touch the underlying onchain storage db since by the end of the process, nothing ultimately changes to onchain storage; and for offchain db we get a single write instead of one per node.
If offchain db also had(has?) cache, then what we have now is the most efficient.
Yes, I'm experimenting with something close to the second option. I did some tests with a PoC version and indeed this is more efficient than what we have now. Also, unexpectedly it's faster when generating proofs. I guess it's because the DB probably does cache the subtrees when first accessed. There are 2 problems however:
MMRStore::append()
always provides an entire subtree. This is true now, but we depend on the underlying MMR library. With future versions of the library this may change and if this happens we would need to do what you suggested (adding some temp on-chain storage as well) which would add some complexity, but still should remain more efficient than what we have now.On the other hand this should simplify the client gadget a bit because we wouldn't need to know the first mmr block and compute leaf/node indexes.
I would actually not change anything here, but simply add a new pallet-mmr runtime API pallet_genesis_block_num() and use that.
Yes, this is actually easier to use than mmr_leaf_count()
if we keep the current overall approach.
I did some tests with a PoC version and indeed this is more efficient than what we have now.
"more efficient" on what ops, based on what tests/benchmarks?
Also, unexpectedly it's faster when generating proofs. I guess it's because the DB probably does cache the subtrees when first accessed.
Yup, you're getting hot cache hits. Just as a note, generating proofs is not "in the hot path", we shouldn't make design decisions for small optimizations there.
Will this approach be compatible with MMBs in the future? Are they structured in such a way that we could also save subtrees for each block?
I suggest we take the easy route now (use mmr_leaf_count()
or pallet_genesis_block_num()
), and optimize it when we bring in MMBs, since like you say optimizations here might not be compatible with MMBs later anyway.
"more efficient" on what ops, based on what tests/benchmarks?
The time needed to generate 10k-100k blocks
I suggest we take the easy route now (use mmr_leaf_count() or pallet_genesis_block_num()), and optimize it when we bring in MMBs, since like you say optimizations here might not be compatible with MMBs later anyway.
Agree. I'll try the pallet_genesis_block_num()
approach. And we can revisit this when switching to MMBs.
I believe this is fixed now
We've been seeing
mmr_generateProof
calls failures on rococo the last weeks.1. InconsistentStore
The first source of these errors is
mmr_lib::Error::InconsistentStore
.This issue is due to the offchain worker not being triggered on every block.
Solution
This can be mitigated by firing notifications on the initial block sync too, as suggested by @serban300 and done here: https://github.com/Lederstrumpf/substrate/commit/a571c512bb49b6cb469de8fd65d8c9a45e1a0ac4 However, this should already be solved by #12753 - will test once runtime 9330 is activate on rococo.
2. CodecError
The other issue is
codec::Error
. The source of these is that the onchain runtime version's api may mismatch what the client was built with. As such, the failures we've seen changed based on what runtime version was current (unless historic chain state is queried). For example, on av0.9.31
orv0.9.32
release node,mmr_generateProof
works for blocks with onchain runtime version9321
or9310
, that is from block 2744973 onwards, but fails with the codec error for earlier blocks. In this case, we changed the API for runtime9310
to use block numbers in lieu of leaf indices with PR #12345, so moving from u64 → u32 leads to the codec error thrown at https://github.com/paritytech/substrate/blob/b0e994c96cb8f5a178bd8b649d6ec855f225a94c/frame/merkle-mountain-range/rpc/src/lib.rs#L188-L195 Confirmed that this is the root of the codec error since changing back to leaf inputs in the proof generation API atop v0.9.31 resolves this: https://github.com/Lederstrumpf/substrate/commit/72cdcadb3a65db61c1088d943a232785c1e81d53Solution
Keeping the mmr API interface stable would avoid this. If we change the API interface, the issue will disappear once the runtime aligned with changes on the API server becomes active. It will then still fail for historic queries, which I see the following options for: