jakmeier commented 9 months ago

(Supersedes #11)

Goals

We want to guarantee the Near Protocol blockchain operates stable even during congestion. This is currently impeded due to a lack of cross-shard congestion control. Specifically the delayed receipt queues of each shard may grow indefinitely during congestion, which is bad since those are part of the state tries.

With this project, we want to ensure all queues of receipts have a fixed limit in size. The queues in question are:

Delayed receipts queue
Postponed receipts queue
Any new queues introduced by the congestion control solution itself

Current status: Local Congestion Control

Right now we have fully implemented Local Congestion Control, meaning that shard validators and RPC nodes will not be overloaded by transactions coming into their local transaction pools and receipts generated from local transactions.

Technically, this is achieved by introducing two limits:

The size of the local transaction pool for each shard is limited to 100MB, after the limit is reached, the node stops accepting new transactions. This limit is configurable
The size of the delayed receipts queue is limited to 20,000 receipts, after the limit is reached, the chunk producer will stop including new receipts in the chunk. This limit is not configurable but is not yet enforced by consensus

For the exact implemented features see https://github.com/near/nearcore/milestone/26?closed=1

Next steps: Cross-Shard Congestion Control

To achieve our goal of bounded queues, we need global congestion control. The solutions currently in consideration fall into two categories.

Either: We drop messages that do not fit in the fixed sizes of the queues. Then we provide a solution to deal with dropped (failed) receipts to contract developers.
Or: We apply backpressure from congested shards to other shards. Shards experiencing backpressure need to throttle the amount of outgoing receipts to the congested shards.

For both categories, many ideas have been discussed already but there has been no clear winner so far.

To move this issue forward, we plan the following steps:

[x] Implement a model to simulate different congestion control strategies
[x] Decide on a congestion control design based on simulation results (Proposal described in https://github.com/near/nearcore/pull/10873)
[x] Implement the solution in nearcore (in progress)
- [x] Add new fields to chunk header and chunk extra (https://github.com/near/nearcore/pull/11128)
- [x] Add outgoing receipt buffers (https://github.com/near/nearcore/pull/11173)
- [x] Implement congestion control algorithm (https://github.com/near/nearcore/pull/11195 + https://github.com/near/nearcore/pull/11200)
- [x] Handle cases with missing chunk (https://github.com/near/nearcore/pull/11274)
- [x] Handle protocol upgrade (https://github.com/near/nearcore/pull/11361, https://github.com/near/nearcore/pull/11307)
- [x] Integration tests (https://github.com/near/nearcore/pull/11247, https://github.com/near/nearcore/pull/11273, and many more added over time)
[x] Benchmark the solution in nearcore

Likely, this will require protocol changes, so we should also add this step:

[x] Creat NEP draft describing the intentions well enough to receive feedback by engineers (https://github.com/near/NEPs/pull/539)
[x] Make NEP for cross-shard congestion control ready for SME reviewers
[x] Get protocol working group meeting scheduled
[x] Get NEP approved at protocol working group meeting

Links to external documentations and discussions

NEP-539 Cross-Shard Congestion Control

Congestion control design proposal documentation, March 2024.

Kick-off for global congestion control February 2024

State of congestion control in September 2023. This document also provides links to additional docs.

Current Zulip thread (feel free to drop any questions or comments there or here in GitHub)

Zulip thread September 2023

High-level overview of final proposal, as accepted in NEP-539: Slides

Estimated effort

We aim to solve most of the engineering work by 31st of May, 2024, with @wacban and @jakmeier working on it 50% each. This is unlikely to be the perfect solution (see Assumptions and Out of scope below) but it should guarantee bounded queues.

Taking NEP approval and the release schedule into account, we expect this to be live in mainnet some time around July or August 2024.

Assumptions

We assume all validators can spend a substantial amount of memory on receipts queues. (100s of MBs)
We can make changes to the protocol that will negatively affect the total throughput of shards, as long as it is a reasonable amount that can be justified with the stability guarantees we gain form it.
The number of total shards stays below 100.

Pre-requisites

Work starts immediately, no pre-requisites needed.

Out of scope

Per-contract fairness
Infinitely scalable congestion control in terms of number of shards
Problems around inefficient receipt queue handling in the current nearcore implementation which may lead to unnecessarily large memory consumption.
Reducing receipt size limits.
Client-side implementations for retrying rejected transactions.
Transaction Priority (this will be follow-up work that depends on completion of global congestion control)
Cleaning up how gas pricing works

walnut-the-cat commented 8 months ago

Either: We drop messages that do not fit in the fixed sizes of the queues. Then we provide a solution to deal with dropped (failed) receipts to contract developers. Or: We apply backpressure from congested shards to other shards. Shards experiencing backpressure need to throttle the amount of outgoing receipts to the congested shards.

Maybe I am not super clear on what you meant by 'Either' and 'Or' here.

jakmeier commented 8 months ago

I just wanted to make it clear that the two categories of solutions discussed so far are split. One set of solutions involves dropping receipts on the go. The other involves backpressure. Either of those can work independently of the other.

But I guess it makes it sounds as if they are incompatible. That's not true, indeed we could combine the two approaches for the final solution.

jakmeier commented 8 months ago

Quick status update:

Done We have a basic model (nearcore/10695) and even the ability for local Grafana dashboard to look at the results (nearcore/10719).

A few sample workloads and strategies are also already included. But those are more of a demo of the model. The workloads are too simple to give complete picture. And the strategies are mostly just demos or exploring specific ideas in isolation. None of the strategies would be a suitable proposal.

Ongoing work This week we are looking into specific strategies and workloads. @wacban and I discussed general ideas and what exactly we want to try this week. We will share relevant results from the experiments when we have them.

And on the way, we will improve the model and the output as necessary.

Progress vs Time Estimate The original plan was to come up with a proposed strategy based on model output by next week. I think we are just in time to hit that mark. Then we can start working on a design document in the form of an NEP and start gathering feedback from everyone else.

jakmeier commented 8 months ago

Status update:

Done

Based on several workloads and strategies we simulated, we collected ideas and evaluated which of those are good and which are bad or useless.

Analysis on advanced workloads, crafter to be harder to predict where work is executed: https://github.com/near/nearcore/pull/10794
Same for workloads with unpredictably large receipts (many bytes per gas): https://github.com/near/nearcore/pull/10822

This has lead to two main strategies we've looked at in more details:

We then compared the two in this document: https://docs.google.com/document/d/1wVQIF0cgilO9m-iI_P5HK6MVc0b6RAxsxtyTZ1nMnBs/edit?usp=sharing

The final result is a merge of the two ideas and results in:

A first NEP draft: https://github.com/near/NEPs/pull/539
A design documentation page for the nearcore guide book: https://github.com/near/nearcore/pull/10873

Ongoing work

Waiting for feedback on our design
Complement missing sections in the NEP
Fine-tune parameters in the proposal: https://github.com/near/nearcore/issues/10874
Look at behaviour with non-continuous workloads: https://github.com/near/nearcore/issues/10704

Progress vs Time Estimate:

The first milestone has been completed ~3 days after schedule. ("Decide on the high level design with the help of queueing-theory simulations.")
We are now 1 week / 3 weeks into the second milestone: "Write down the detailed design (NEP) and a reference implementation in nearcore". Given that we already have a first draft of the NEP, the timeline is looking okay. But we need to start working on the reference implementation immediately.

Projected Solution Quality

Initially we defined a set of must-have properties and a set of aspiration properties. Let's check in on them on which we think we can achieve.

Must-have:

Queues in the system are bounded in how many bytes they require. (This is per chunk)

We will have limits in place. But they won't be explicit limits in bytes that we can guarantee. So this requirement will not be fulfilled as cleanly as we hoped for.

The NEP for this is approved and merged.

The implementation is merged to the nearcore master branch, stabilized, and ready for testnet deployment.

It looks like we will hit those in time.

Aspirational Properties

Queues are small enough to be kept entirely in memory

Again, we don't have hard guarantees in our solution. But we believe this trade-off is necessary and will still ensure that in all but the most targeted malicious cases, it will fit into memory. And certainly, it will be a strict improvement over today's system in malicious cases.

Once accepted, transactions and all following receipts will not fail.

We will satisfy this.

Bounded latency to resolve a receipt.

We will satisfy this and can still decide what we want to guarantee to be, trading it against utilization in marginal cases.

In a non-congested setting, every shard can run at full speed.

Satisfied.

Transactions that only touch non-congested shards are not affected by congested shards.

Partially satisfied. Backpressure means every shard that is on the path of congesting flows will become congested and experience negative consequences, even if their shard on its own wouldn't be congested. But all other shards are completely unaffected.

The same strategy can be used for any number of shards.

This seems fulfilled about as well as we can expect it to.

We require 4 bytes in chunk header, it seems this puts no practical scaling limits on the number of shards beyond what the chunk header already has (it is already >200 bytes).
The buffer space from one shard to all others is shared.
- positive: Memory requirements do not scale by number of shards, so it can be applied to any number of shards.
- negative: Utilization might suffer with many shard, as the allocatable buffer space per outgoing shard becomes smaller. (Needs further investigation)

jakmeier commented 7 months ago

Status update:

SME reviewers have been assigned (@robin-near and @akashin) a few days ago
To not be blocked on approval, we are already implementing the complete feature against nearcore and started merging PR with the changes behind a feature flag

jakmeier commented 4 months ago

Overdue update:

NEP-539 had been accepted in the call on May 24.
The full implementation is in nearcore master.
The release of the feature has been delayed since the next release contains the huge change to stateless validation (SV). This means congestion control needs to work well with SV, which required extra testing and a few changes due to additional features SV requested from congestion control. But it looks like SV will hit testnet next week, which will include cross-shard congestion control.
Beyond the scope defined in this issue, we have also looked at changes on the RPC nodes.
- New error types for a congested shard (https://github.com/near/nearcore/pull/11419, https://github.com/near/nearcore/pull/11716, and public documentation for it https://github.com/near/docs/pull/2018)
- A new RPC method to query the congestion level https://github.com/near/nearcore/pull/11481.

near / near-one-project-tracking

[ProjectTracking]: congestion control #48

Goals

Current status: Local Congestion Control

Next steps: Cross-Shard Congestion Control

Links to external documentations and discussions

Estimated effort

Assumptions

Pre-requisites

Out of scope