Limit size of source_receipt_proofs inside ChunkStateWitness

jancionear commented 5 months ago

There could be a lot of large incoming receipts, which could cause the source_receipt_proofs field inside ChunkStateWitness to be really large. There are known problems with distributing large instances of ChunkStateWitness, so we should analyze how large the incoming receipts can be, and add a size limit to make sure that their size stays reasonable.

Refs: https://github.com/near/nearcore/issues/10259, https://github.com/near/nearcore/issues/11103, https://github.com/near/nearcore/pull/11274

/cc @wacban @Longarithm, I know you already spent some time thinking about incoming receipts.

I think we should at least look into it before stateless validation launch, to estimate how bad things can get.

jancionear commented 5 months ago

One solution would be to allow shards to send out large receipts on every nth block, round robin style.

Something like:

if block_height % num_shards == my_shard_id {
    // allowed to send out 4MB of receipts
} else {
    // allowed to send out 50kB of receipts, the rest of outgoing receipts should wait in a queue
}

This would ensure that incoming receipts from one block height are at most 4MB + (num_shards - 1) * 50kB in size.

There's still a problem when we have missing chunks and the incoming receipts are from multiple block heights, but this could be solved by adding a condition that a chunk is allowed to send out large receipts only when the previous chunk wasn't missing.

wacban commented 5 months ago

@jancionear I think it's best to reason about the aggregated sum of the receipts rather than just the size of an individual receipts. At the end of the day it's the total size of all receipts that contributes to the state witness.

With that in mind I would suggest deploying a new metric to a canary node showing the typical total size of outgoing receipts per receiver shard. Assuming that on average this metric remains somewhat reasonable we can then consider what you suggested - limiting the number of bytes a shard is allowed to send to another shard in one block with round robin extra allowance.

jancionear commented 5 months ago

Another thing to consider is smaller receipts being blocked by large ones. If we were to just process receipts in FIFO order, waiting until an outgoing receipt can be sent out, we could end up in a situation where the first receipt in the queue is really large, and we'll wait for a few blocks until we're able to send out this receipt, without sending anything else for those few blocks. A single big receipt blocks all small receipts.

To fix this we could have two separate queues of outgoing receipts - a queue for big ones that can only be sent in the special blocks, and another for small ones that can be sent from any block.

wacban commented 4 months ago

My current take for it would be as follows:

Let's pick a target maximum size of source_receipt_proofs and call it MAX.
Let's set a reasonable limit on the number of bytes that one shard can send to another. I think this limit can be as high as MAX or some high percentage thereof (e.g. 75%). This way we still allow full throughput in the case where two shards just exchange a lot of bytes. It can be done by leveraging congestion control. However while this may work in the average case and in the case where one shards spams another it wouldn't cover the worst case where all shards spam a single target shard.
In the worst case scenario the state witness would still exceed the size limit and may not be distributed in time to the chunk validators. We can rely on a combination of other solutions to handle that situation:
- Thanks to #11344 eventually a new chunk will be produced in the affected shard and we can move past the troublesome state witness.
- Thanks to #11274 the situation will not be getting worse due to missing chunks
- Thanks to congestion control backpressure will be applied and transactions will be rejected until the chain recovers.

jancionear commented 4 months ago

Thanks to https://github.com/near/nearcore/pull/11344 eventually a new chunk will be produced in the affected shard and we can move past the troublesome state witness.

Does #11344 help with large chunk state witnesses? My understanding was that it helps when it takes a long time to apply a chunk, but there could still be a situation when a chunk isn't endorsed in time because the validator didn't receive a witness for this chunk in time.

wacban commented 4 months ago

Does #11344 help with large chunk state witnesses? My understanding was that it helps when it takes a long time to apply a chunk, but there could still be a situation when a chunk isn't endorsed in time because the validator didn't receive a witness for this chunk in time.

I would hope so but definitely needs testing and checking. Perhaps you can introduce a way to manually increase the size of the state witness (good for testing in forknet) or slow down the state witness distribution (good for testing in nayduck) and see if the chain can recover? I think both are worth doing. You can follow the example of slow_chunk.py where neard is compiled with the test_features feature and you can add adversarial behaviour, custom host functions or whatever else you need for simulating difficult behaviour.

jancionear commented 4 months ago

During the last meeting Bowen said that https://github.com/near/nearcore/pull/11344 might not help with large witnesses, we might need a separate fix to deal with witnesses that take a long time to distribute.

jancionear commented 4 months ago

Even if we can recover from extra large witnesses, I feel that we still need to properly limit witness size. A malicious actor could try to cause the worst case scenario on every block, which would make the chain struggle.

jancionear commented 4 months ago

The approach from https://github.com/near/nearcore/issues/11295#issuecomment-2110617770, where one shard is allowed to send more than others isn't the best, because the receipts have to wait for up to num_shards block before being sent.

Maybe it would be possible to make a better solution by making the shards negotiate how much they're allowed to send out on each block height. At each height every shard would publish an intent of sending out X MB of receipts to some shard. Then on the next height every shard sees how much every other shard wants to send out, and can deterministically decide how much they're allowed to send. In scenarios of low congestion this would allow large receipts to be sent in ~2 blocks, instead of num_shards blocks. With high congestion the deterministic algorithm would determine who is allowed to send out a large receipt at each height, in a fair way.

That sounds like a better solution, although it's more complex, so for now we can go with the simple round robin.

jancionear commented 4 months ago

In https://github.com/near/nearcore/pull/11492 I implemented a basic receipt size limit using the approach where one chosen shard is allowed to send out more at each block height, while other shards wait for their turn.

It limits the receipts, but there are some flaws. We had a meeting discussing it, here's a short summary:

The first problem is that small receipts can be blocked by big receipts. If a small receipt is enqueued after a big one, then the small one will have to wait until the big one is sent out, which happens only once every num_shards.

Another problem is that the change might not play well with congestion control. Congestion control reacts aggresively to buffered receipts. Once there is 500 TGas worth of enqueued receipts, the shard is considered fully congested and cannot receive any more receipts. Having 500 TGas of enqueued receipts could become fairly common because of small receipts getting stuck behind the big ones.

One way to solve the problem of small receipts getting stuck would be to have a separate queue for large receipts. But this would be problematic for the transaction priorities NEP. With transaction priorities we would like to send receipts with higher priority before the ones with low priority. What should happen when a big receipt has higher priority, but we're unable to send it at this block height? Should we send a smaller receipt with lower priority? Or wait for the big receipt? Two queues cause a headache there.

There was also the question of whether we need this kind of limit at all. If we reduced the receipt size limit o 1.5 MiB, then with six shards we could receive at most 6*1.5 MiB = 9 MiB of incoming receipts, which sounds somewhat managable. I believe that having a limit on incoming receipts to a single shard is consistent with the sharded design of NEAR. A single shard can only process so much, and can only receive so much. As we scale horizontally and keep adding more shards, the issue will become more pronounced. With 100 shards we could have 150 MiB of incoming receipts, which isn't going to work. We need to incorporate a limit on incoming data into our design sooner or later.

There is also the concern of DoS attacks. What happens when someone submits a lot of large receipts aimed at a single shard? Cross shard throughput is limited, so it will cause a lot of congestion. I think that gas-based congestion control has to deal with the same problem, so maybe there can be some universal solution to discourage sending a lot of data to a single shard - higher gas prices or something like that.

jancionear commented 4 months ago

Because of the trouble with transaction priorities I became less enthusiastic about the two-queue solution. I think it would help in the short term, increasing the throughput and lowering latency in presence of large receipts, but it's incompatible with transaction priorities.

But I have another idea that should work better: partial receipts A lot of headache is caused by the fact that we can only send whole receipts, which can be really large. What if we added a way to split a large receipt into many small partial receipts? The sender shard could send a limited amount of partial receipts at every block height instead of sending one large receipt. This way we wouldn't have a dilemma whether to send a small receipt with low priority before the large receipt. Every receipt (or at least a part of receipt) is sendable at every height, so we can have a single queue/buffer of outgoing partial receipts and send some of them out to the receiver shard at every block height. The receiver has to reconstruct the large receipt from the partial ones, but that sounds doable.

near / nearcore

Limit size of source_receipt_proofs inside ChunkStateWitness #11295