[stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit

Longarithm commented 6 months ago

When we change state KV pairs, it happens on e.g. TrieUpdate::set which doesn't access nodes on its own. Nodes required to update trie are read in the end, on TrieUpdate::finalize. While Runtime::apply produces correct state witness, its "online" size is not computed correctly because of that.

One way to compute better size is to always call Trie::get on TrieUpdate::set.

Note that TrieUpdate::finalize still may access new nodes due to branch restructuring. But I think the impact of that it small, only up to 2 nodes per each key removal.

Discussion https://near.zulipchat.com/#narrow/stream/295558-core/topic/Trie.3A.3Aget.20on.20TrieUpdate.3A.3Aset/near/428163849

jancionear commented 6 months ago

To expand a bit on the problem:

When the runtime reads or writes a value in the trie, it calls storage_read/storage_write. This function performs the operation on the TrieUpdate struct and additionally calculates TTN (Touched Trie Nodes) to charge gas for each trie node that was touched. With stateless validation there's one more thing that needs to be done - we need to record all nodes from the pre-state trie that are touched during chunk application to construct the storage proof. So ideally on every read or write we would walk down the Trie, perform the operation, calculate TTN and record all the touched nodes from the original Trie. After every read or write we would charge gas for the TTN and see how much storage proof was recorded to limit state witness size. Any receipt that uses too much gas or produces too large storage proof would be rejected. But this isn't what actually happens in the code - when the runtime performs a write using TrieUpdate, it doesn't actually perform the operations on the Trie. It maintains a hashmap from trie keys to key values. Each write adds an entry to this hashmap. Each following read reads from this hashmap, and only reads the Trie if a key isn't available in the hashmap. The Trie itself is only modified during finalize(), which happens at the end of chunk application, once all receipts in the chunk have already been executed. For reads this isn't that big of a deal - every read that has to be recorded won't be available in the hashmap (which contains only changes written in this chunk), so it'll naturally go through the trie, and all nodes required for the read can be recorded there. With writes the situation is worse - a write can be performed as a single hashmap operation, without ever touching the original trie. To deal with this the runtime performs an additional read for every value that is written (it's done here). Originally this was done to calculate the number of TTN and charge gas accordingly, but now it also serves the purpose of recording the nodes touched during the write. Flat storage reads do a similar thing - they perform an additional Trie read to record the nodes (which makes flat storage pretty useless x.x).

This sounds like a workable solution - we record all nodes on the path to all read and written values, but there's one corner case that isn't covered by this. When there's a branch node with two children and we delete one of the children, the branch node should be merged with the the surviving child, as it doesn't make sense to have a branch node with only one child. In such case we will record the child that is deleted (as we will do a read before writing to it), but we won't record the other child, even though recording the other child is necessary for performing the merge.

This problem was addressed in #10841 by recording all nodes touched during finalize(). finalize() actually performs the operations on the trie, so it will touch the other child while merging it. It fixes the problem of nodes missing in the storage proof, but it doesn't help the runtime, as the recording happens during finalize(), after all receipts have already been executed.

This means that runtime is not aware of those extra "surviving child" nodes, and as a result it's unable to charge gas or enforce storage proof size limit for them.

There are two main questions here: 1) What is the impact of this? Could a malicious contract abuse this problem to bypass the state witness size limit and generate a storage proof so big that it crashes the system? 2) How to fix it?

jancionear commented 6 months ago

Impact

What's the worst that could happen?

A malicious actor could prepare a state which allows to execute the corner case as many times as possible. They could create a bunch of branch nodes with two children - one very small, and the other as big as possible. Then in another contract call they could delete all of the small children without touching the big ones. This would allow them to put the big child in the storage proof while paying only for the branch node and the small child. How bad would that be? I need to read into the exact serialization code, but for now my estimates would be:

Branch node with 2 children: (32 bytes for node hash) + (233 bytes for children hashes) + (14 1 byte for 14 None children) + (1 byte for None value) = 113 bytes
Small leaf node = (32 bytes for node hash) + (1 byte for path) + (32 bytes for value hash) = 65 bytes
Big leaf node = (32 bytes for node hash) + (2000 bytes for path) + (32 byes for value hash) = 2064 bytes

The cost to record the big node (2064 bytes) would be (113 + 65 = 178 bytes). This means that a malicious actor could record 12 times more state than what's accounted for in the storage proof size limit! ( (2064+178)/178 = 12.5 ).

That's pretty bad :/ It means that a malicious actor could grow the size of ChunkStateWitness about 10 times, from 8 MB to 80 MB. There are already problems with distributing an 8 MB witness, so an 80 MB one could fail altogether.

IMO it would be good to fix it. Although with all the improvements in network distribution it might turn out that it isn't all that necessary, we'll see how it goes. Still it's good to consider what we could do to fix it. I spent some time thinking about various ideas, discussing with Robin, and I'll write them below in separate comments.

jancionear commented 6 months ago

Idea 1 - Charge extra to account for the unrecorded children

We could just assume that every time we delete something there's also a big malicious child that'll be recorded and charge 2000 bytes extra. The big problem with that is that it lowers chunk capacity - charging 10x more for an operation means that the number of receipts that fit in a chunk will be a few times lower. There are already problems with congestion, so limiting throughput sounds like a bad idea.

We could add some clever heuristics (only charge once per branch, count children, trace deletions, etc...). That's something to think about, but it'd be an imperfect, complex solution. I'm not a big fan of this idea.

jancionear commented 6 months ago

Idea 2 - Immediately perform the operations on a real trie

A big reason why we needed TrieUpdate is that the Trie used to live on disk, which is very slow to read and write from. Keeping the changes in memory allows for faster access and writing the changes to disk in one big batch during finalize(). But that's in the past now, we've transitioned to in-memory tries, so we could apply the operations directly to this in-memory trie instead of using the hashmap in TrieUpdate. On every read/write we would immediately perform this operation on the memtrie, and immediately record all nodes that are touched during the operation. It'd give the runtime precise information about how much storage proof was generated by each operation.

This solution would be very clean - no strange tricks, we just apply the trie changes like we said we do. A downside is that it could affect performance. Reading and writing to a hashmap is likely faster than jumping between memtrie nodes. Measuring the performance impact would require some benchmarks.

jancionear commented 6 months ago

Idea 3 - Finalize after every receipt

Currently we finalize after every chunk, but we could do it after every receipt instead. All operations during the receipt's execution would be performed on TrieUpdate just like before, but once the receipt is done we would go and apply the update on a real trie (memtrie). After finalizing we know exactly how much storage proof was generated by this receipt, and we can fail it if it went over the hard per-receipt limit. It'd allow to keep the performance benefits of TrieUpdate while still providing precise information about storage proof size.

Postponing node recording could have additional performance benefits - recording the node on every node access means that we have to perform extra work during every access - maybe serialize the node, maybe hash it. This work is repeated every time we access this node, which is wasteful. Recording only during finalize could allow to record every node exactly once, which could be more performant.

jancionear commented 6 months ago

Idea 4 - Modify how the trie works

What if we didn't merge the branch during deletion? We could leave this work to the next person that tries to read the value from the surviving child. With such a modification it'd be enough to record nodes that are on the path to the value that is read/written.

We'd need to think about the implications of such a fundamental change.

Is it okay to force another person to do a lot of work?
- it might be ok, we already do this when we force someone to merge/record the big child
Is it okay to have multiple representations of the same trie?
- For correctness the trie changes have to be deterministic, but the representation might not need to be unique.
- There could be some trouble with memtrie loading/saving
???

It's a very interesting idea, but it'd change how a fundamental piece of the system works, so it could be a lot of work. OTOH it'd prepare us for Verkle trees later ;)

Longarithm commented 6 months ago

It makes sense, thank you very much for the overview! Which one of these ideas would you suggest? I'm still for (1) because:

Users rarely delete data from blockchain. So deletion will unlikely lower throughput. If we want to be careful, we can return back from Trie a boolean flag like "did the latest branch have only 2 children?" and charge extra only in this case, where restructuring would be needed. We can also measure how many storage_remove ops are called on mainnet.
12x increase would happen only in the edge case when each deletion reads exactly one small node and cause read of one extra big node. However, if random deletion causes ~5 new (branch) node reads and one extra big read, then overestimate is 113 * 5 = 565 -> 565 + 2000 = 2565 which is 5x increase. Still bad but not that much.

In the longer term, anything from (2) to (4) looks feasible. Perhaps (4) is simpler and (3) is harder to implement, because it requires to serve state reads from MemTrieUpdate which would work but it wasn't designed with that usecase in mind.

jancionear commented 6 months ago

I'm still for (1) because: Users rarely delete data from blockchain. So deletion will unlikely lower throughput. If we want to be careful, we can return back from Trie a boolean flag like "did the latest branch have only 2 children?" and charge extra only in this case, where restructuring would be needed. We can also measure how many storage_remove ops are called on mainnet.

Hmm that's a good point, if it's true that contracts rarely delete things then charging extra wouldn't hurt normal users that much, while still protecting us from malicious actors. I dislike (1) because it's an unclean workaround that'll add even more complexity to Trie logic, which is already super complicated, but maybe it's the fastest way to go, even if it adds some tech debt. I'll try to gather some data to see how things are.

Which one of these ideas would you suggest?

I really like (2), because it's a clean simple solution, but it might be a big effort to get it done, make sure that there are no performance problems, etc. (3) sounds more feasible to me, while still being a proper solution. I need to dive into the code and figure out what's feasible.

requires to serve state reads from MemTrieUpdate which would work but it wasn't designed with that usecase in mind

We could still serve reads from TrieUpdate, but we'd flush the commited updates to MemTrieChanges after every receipt (???? idk)

Longarithm commented 6 months ago

that'll add even more complexity to Trie logic

Just to clarify, adding 2000 to the size estimation on each storage_remove doesn't look like too much complexity, and it doesn't have to be inside Trie. Or you mean additional logic with "checking if there were 2 children"?

jancionear commented 6 months ago

Just to clarify, adding 2000 to the size estimation on each storage_remove doesn't look like too much complexity, and it doesn't have to be inside Trie. Or you mean additional logic with "checking if there were 2 children"?

Eh IMO it counts as complex, it's one more thing that you need to remember about and be cautious to do properly. It's very nonobvious, and it'll be easy to forget about charging the extra gas somewhere. People who will work with this code in the future won't be aware of it, it's a trap set for them. It's not the end of the world, but I'd prefer a solution which is intuitive, without such gotchas.

Longarithm commented 6 months ago

Ah I see. Actually, I don't think we should necessarily charge more gas. I just mean maintaining u64 variable which estimates state witness size, which is usually size(recorded_storage), and say every time when storage_remove is invoked, we add 2000 there. We also shouldn't worry about undercharging users there. They already have to pay for storage_remove. This operation, as each storage operation in contract, has a base cost, which can account for that. And also we charge using "Gas costs" but in fact chunk space is limited by "Compute costs" which already can be seen as severe undercharging. Additional undercharging in storage_remove doesn't look too serious to me.

walnut-the-cat commented 6 months ago

@robin-near , @tayfunelmas , @shreyan-gupta , @bowenwang1996 , and I talked about ideas in the warroom and we converged into our own preference. We can talk about pros/cons of each idea during stateless validation meeting tomorrow

jancionear commented 6 months ago

For now I'll work towards implementing Solution 1 (charge extra). It's a simple solution that could be good enough (depends on the data gathered from https://github.com/near/nearcore/pull/11000).

From other discussion it seems that we've converged on Idea 3 (finalize after every receipt) as a more proper solution, but implementing it would involve more effort. We can consider implementing the proper solution later, after the MVP release.

walnut-the-cat commented 6 months ago

cc. @robin-near , @tayfunelmas , @shreyan-gupta .

If we go with solution 1, do we plan to implement hard limit separately? I thought benefit of solution 3 was us being able to tackle soft limit corner case and hard limit at the same time? Separate question. I understand that solution 1 will result in less number of receipts in a chunk, but what if one receipt is already too big? (e.g. 50mb) Won't it still cause chunk miss to happen?

shreyan-gupta commented 6 months ago

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

jancionear commented 6 months ago

cc. @robin-near , @tayfunelmas , @shreyan-gupta .

If we go with solution 1, do we plan to implement hard limit separately? I thought benefit of solution 3 was us being able to tackle soft limit corner case and hard limit at the same time? Separate question. I understand that solution 1 will result in less number of receipts in a chunk, but what if one receipt is already too big? (e.g. 50mb) Won't it still cause chunk miss to happen?

Going with (1) doesn't cause any trouble with implementing the hard per-receipt limit. During storage_* we can ask for the adjusted size of the storage proof (with extra charges for removals) and fail the receipt if the adjusted size gets too large. Implementing the hard limit with (3) sounds harder to me, as it would require us to fail a receipt after it has already been executed. It should be easier with (1).

jancionear commented 6 months ago

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

I don't know, is it expected that the current binary can apply any block from the past?

robin-near commented 6 months ago

I don't know, is it expected that the current binary can apply any block from the past?

That's still expected today as far as I understand.

shreyan-gupta commented 6 months ago

So, confirming, solution 1 basically says the following right?

During state witness size calculation, in the recorder, if we have a node delete operation, add/account for an extra 2000 bytes in the calculation
This is effectively tracked by recorder class via trie_update.remove() call
We would need to now call this as state_witness_size_estimate or state_witness_size_upper_bound
Finalize works as expected and calculates the actual state witness and its size

shreyan-gupta commented 6 months ago

Meanwhile for the hard limit, we were having some more discussions here, as I'll try to give a brief summary

Our original though of "finalize after each receipt execution" to get the size of witness (per receipt) doesn't work as it's impossible to rollback touched trie nodes. Once touched, they need to be a part of the state witness else the validation can not happen properly.

What we need to remember here is that chunk validators only work on the state witness partial trie and don't have access to the whole trie like the chunk producer does.

Instead, we need to tap into the runtime and stop execution as soon as we reach the hard limit. This would be consistent across chunk producer and validators.

To be noted, initially we were thinking about this as a hard limit per receipt execution, but we may need to change our mindset and think of it as hard limit for the whole chunk/for all receipts instead as that's the most straightforward way to implement it. (We only keep a running track of state witness size, not for individual receipts).

We would have to then set the limits as something on the lines of hard_limit = soft_limit + max allowed limit per receipt.

shreyan-gupta commented 6 months ago

In terms of implementation, the logic for the hard limit need to go via trie_update calls in runtime/runtime/src/ext.rs

Code flow is

apply in runtime/runtime/src/lib.rs
storage_remove, storage_read etc. in runtime/near-vm-runner/src/logic/logic.rs
storage_get, storage_remove etc. in runtime/runtime/src/ext.rs
trie_update

jancionear commented 6 months ago

Our original though of "finalize after each receipt execution" to get the size of witness (per receipt) doesn't work as it's impossible to rollback touched trie nodes. Once touched, they need to be a part of the state witness else the validation can not happen properly.

Oooh that's a good point. To prove that executing the receipt generated 100MB of storage proof we would have to include the 100 MB storage proof in the witness x.x. You're right, post execution checking doesn't really look viable. That kinda kills idea (3) :c I guess that promotes (2) to the "proper solution" position...

jancionear commented 6 months ago

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

Eh yeah, this sucks :/

But I'm kinda thinking that after moving to stateless validation we might want to throw some of the old trie infrastructure away anyway. There's a lot of things that don't make much sense with stateless validation - accounting cache, flat storage, etc. Maybe in the future we would be able to put all of this legacy trie code in some module and make a clean new implementation based on memtries (something like (2))?

jancionear commented 6 months ago

I'll try to gather some data to see how things are.

I added some metrics to evaluate viability of idea 1 (in https://github.com/near/nearcore/pull/11000), and started a shadow validation node to see how the upper-bound size compares to the recorded size.

Here are the grafana dashboards (starting at "Recorded storage size per receipt"): https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&from=1713182578000&to=1713193378000&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf

jancionear commented 6 months ago

Here are the grafana dashboards (starting at "Recorded storage size per receipt"): https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&from=1713182578000&to=1713193378000&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf

Some observation about the data:

99% of receipts generate very little storage proof (< 50KB)
There are some outlier receipts that generate up to 2MB of storage proof
The maximum values of the actual per-receipt recorded storage and the upper bound are fairly similar, both are under 2MB
95% of chunks generate small storage proofs (< 1MB)
There are some outlier chunks that generate up to ~2.5MB of storage proof
The maximum value of per-chunk upper bound is ~2x larger than the actual recorded size. The upper bound size gets as large as ~5MB on outliers
For 95% of chunks the upper bound is a pretty accurate estimate, at most 20% larger than the actual recorded size.
There are some outlier chunks for which the upper bound estimate is 7x larger than the actual recorded size.

Based on this we could try setting the per-receipt limit to 4MB, and the per-chunk soft limit to 8MB. It leaves us with a bit of safety margin to make sure that the existing contracts won't break, while ensuring that the total size of storage proof stays under 12MB.

The truth is that gas costs seem to already limit the size quite effectively. All of the chunk-level storage proofs are under 2MB (on mainnet traffic). The limit doesn't really need to maintain the 2MB size of storage proof, as it's already maintained by gas costs. It only has to protect against malicious traffic aiming to generate a huge storage proof.

walnut-the-cat commented 6 months ago

The maximum values of the actual per-receipt recorded storage and the upper bound are fairly similar, both are under 2MB

How often do this go out of sync?

The maximum value of per-chunk upper bound is ~2x larger than the actual recorded size. The upper bound size gets as large as ~5MB on outliers

I guess this is the other 5% of the cases? (since you mentioned upper bound is accurate for 95% of chunks)

Based on this we could try setting the per-receipt limit to 4MB, and the per-chunk soft limit to 8MB.

Is this pre-compression size?

cc. @saketh-are , @shreyan-gupta

jancionear commented 6 months ago

How often do this go out of sync?

I only collected this data for receipts which generate >100KB of storage proof, the ratio could get really high for small receipts, and small receipts don't matter for hard limit, so I excluded them. I can't say for sure, but I'd estimate that ~5% of receipts have a big difference between the upper bound estimation and the actual value.

I guess this is the other 5% of the cases? (since you mentioned upper bound is accurate for 95% of chunks)

Those really big ones are ~0.2% of all chunks. I guess we could ignore them and lower the soft limit further. The soft limit can be arbitrarily lowered as long as it doesn't affect throughput too much.

Is this pre-compression size?

Yes, although Trie nodes are mostly hashes, which aren't compressible, so it might not matter that much.

jancionear commented 6 months ago

Added one more dashboard to see how many chunks are at most X MB Big:

It looks like we could set the soft size limit to as low as 3MB, and 99% of the chunks would still fit in this limit. For the other 1% we would move the receipts to the delayed queue and execute them in another chunk. That would be a ~1% slowdown in chunk throughput, but it'd give us a solid guarantee that no chunk storage proof is larger than 7MB.

walnut-the-cat commented 6 months ago

There are some outlier receipts that generate up to 2MB of storage proof

Can you also check where these outliers are coming from? If they are from major dapps (e.g HOT), that can be troublesome

jancionear commented 6 months ago

Can you also check where these outliers are coming from? If they are from major dapps (e.g HOT), that can be troubles

I added some logs to print out receipts that generate more than 500KB of storage proof (upper bound). Looking at the past 12 hours there are five accounts which produce such receipts:

aurora - generates ~700KB of storage proof
nft.w3music.near - generates ~700KB of storage proof
claim.sweat - generates ~500KB of storage proof
v2_0_2.perp.spin-fi.near - generates 50KB of storage proof, but the upper bound is ~1MB
a.mitte-orderbook.near - generates ~200KB of storage proof, but upper bound is ~1.3MB

For v2_0_2.perp.spin-fi.near and a.mitte-orderbook.near the upper bound estimate is very inaccurate, up to 20x larger than the normal recorded size, but it's still under 2MB, so we can live with it.

Here are the logs: big_receipts.txt

jancionear commented 6 months ago

I added some logs to print out receipts that generate more than 500KB of storage proof (upper bound).

We've got a new record! sega.segaai.near generated 2.1MB of storage proof (upper bound estimated to 2.3MB)

Apr 17 03:14:59 ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf neard[132426]: 2024-04-17T03:14:59.035123Z  INFO apply_chunk{shard_id=4}: runtime: Receipt with large storage proof! v2 recorded_storage_diff=2118418.0 recorded_storage_upper_bound_diff=2320418.0 receipt_id=4Qz3F5ZN4npjvLUdaoR1fBiHRiqQHpPWxFFg82mbXgy1 predecessor_id=AccountId("sega.segaai.near") receiver_id=AccountId("sega.segaai.near")

jancionear commented 6 months ago

https://github.com/near/nearcore/pull/11069 worked around this issue by implementing Idea (1) (upper bound estimation of the recorded size). The issue is no longer a blocker for stateless validation, but IMO it's still an issue. The upper bound estimation is inaccurate, it'd be much cleaner to fix the recording issue properly. I think the issue should remain open, but we can remove it from the stateless validation roadmap.

near / nearcore