near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 605 forks source link

[stateless_validation] Missing main transition state proof for old block #10964

Open staffik opened 3 months ago

staffik commented 3 months ago

When running stateless validation cluster with shard shuffling and with a lot of consecutive missing chunks. Happened both on statelessnet and in adversenet, using 84.2 statelessnet protocol release. The error looks like this:

panicked at chain/client/src/stateless_validation/state_witness_producer.rs:183:25:
Missing main transition state proof for block B4uA4cB4BEu6dmKCBz3bJMpvZSqyLXF84HWExo7dynJ9 and shard 3

The likely reason is that state transition data for chunk X was GC-ed after we switched shard, and after quite a long time we needed this data for chunk X.

bowenwang1996 commented 3 months ago

I don't understand. With mainnet's epoch length, why would state witness for a chunk be needed more than 1 epoch later? We can assume that a chunk will be produced every epoch

staffik commented 3 months ago

I think it only can happen if we have a very long range of consecutive missing chunks. Something that won't probably happen on mainnet, but it happened on statelessnet.

Longarithm commented 3 months ago

We can assume that a chunk will be produced every epoch

@bowenwang1996 I don't feel confident about this for making design decisions... With stateless validation, we are reducing number of CPs to only a few. So there is some probability of them colluding and just stopping producing chunks for the whole epoch. After that our assumption would break and I have no idea what will break in the whole chain :) We've also seen some specific shard stalling in statelessnet due to technical issues AFAIU. Let's say some shard stalls on mainnet due to some weird bug (some chunk extra appears missing at all nodes, idk). If we don't debug this in a day, we end up with all chunks in epoch missing again.

I think we need some exact defensive mechanism against it. Like, don't finalise epoch until each shard has at least one chunk in it. But again, it doesn't help with attack above. Or, if we are confident in mainnet validators, I think we still need some workaround for our test chains.

11039 is also slightly relevant.

walnut-the-cat commented 3 months ago

So there is some probability of them colluding and just stopping producing chunks for the whole epoch.

What are they getting by doing so? Doesn't this mean they won't get any reward and risking themselves of getting kicked out?

bowenwang1996 commented 2 months ago

Like, don't finalise epoch until each shard has at least one chunk in it. But again, it doesn't help with attack above.

I think we should do that, but mainnet launch does not have to block on it. Also, why doesn't it help with the attack? Yes if all chunk producers collude they can prevent an epoch from ending but it doesn't serve them any benefit. Rather, they won't get any reward if an epoch doesn't end.