solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.95k stars 4.16k forks source link

Congestion control for staked/unstaked nodes #29754

Open KirillLykov opened 1 year ago

KirillLykov commented 1 year ago

Problem

If client sends txs to validator and network latency is large, tps is very low (see table below). Contrary, if we place client/server in one datacenter TPS will be 100k higher.

Problem setup

Single node private network:

Looks like it is due to QUIC receive window size set here. As pointed out by @lijunwangs, it is a adjusted by constant QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO see quic.rs:25 which is by default 1. Setting this parameter to 10 (max value for staked) improves situation but cannot beat UDP results.

Protocol Ratio* TPS
UDP - 6375
QUIC 1(default) 138
QUIC 10(default for staked) 1256

For the sake of experiment, I tried also QUIC_MAX_UNSTAKED_CONCURRENT_STREAMS = 1024 (default 128) together with QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO = 10: TPS = 1478.

Consequence

If client application uses TPUClient to send txs, the variance of TPS is high. It happens because TPUClient will try sending to the next leaders and TPS will be strongly correlated with the network latency. Contrary, if we pick ThinClient and send always to the closest validator, TPS will be higher because validator will forward txs to the leader and it will use stacked quic configuration.

Current implementation summary

Unstaked nodes: the window size is set to minimum (1*MTU) which leads to low PPS on networks with packets drops. Staked nodes: the window size depends on the stake only and is varied in the range 2..10.

Proposed solution

Atm, discuss a congestion control model which maximizes utilization of network taking into account validator load.

Preparing a setup for reproducible testing is important sub task.

Discussed a bit with @lijunwangs and @ilya-bobyr

mschneider commented 1 year ago

This should have quite an impact in production, mainly because most RPC providers purely run unstaked RPC nodes. There's only two i'm aware of that re-submit txs from their staked validators (jito & triton).

aeyakovenko commented 1 year ago

@mschneider @KirillLykov we need some way to prevent the bot armies from spamming the leader. so unstaked nodes need to be capped, anyone that needs more bandwidth should be using staked nodes to forward and we should optimize that interface instead.

If the unstaked nodes can saturate the line and prevent staked messages from getting through it's a DOS issue.

Overclock-Validator commented 1 year ago

I feel like there should be some unstaked node whitelisting option?

janlegner commented 1 year ago

In Marinade we built the first version of mTransaction. Users can send transactions to our service. These transactions are streamed over gRPC to validators who have previously connected to our service. This effectively backs up transactions with the stake of the validators connected to our system.

We are now in beta testing of this and we got ~ 11 M SOL stake connected and a few users (e.g. Triton who helped us design and build this)

Cerbah commented 1 year ago

I also want to add that Marinade would love for mTransaction to solve this issue for the network as a whole, and not be a private product but an open solution more closely integrated into the core and easy to use for anyone that needs it.

We'd love to discuss some models and the evolution of the product to make it reach this stage.

I see Anatoly mentioning an "interface" being needed to access staked nodes and I truly believe mTransaction could end up being it.

KirillLykov commented 1 year ago

@aeyakovenko

For unstaked, there are several implications of the slow connection:

  1. for our benchmarking: bench-tps client demonstrates smaller tps than usual
  2. RPC node might be affected as @mschneider mentioned
  3. Clients will be encouraged to use some proxies to send transactions instead of using TpuClient directly

My understanding is: (1) can be addressed by using some staked identity with the client. (2) might be addressed by some new solutions like one mentioned by @janlegner (3) we will limit tps coming from unstaked clients even if the network is not experiencing high load (might be a desirable side effect or not)

For staked, after some discussions I have personally concerns (without having any proofs atm) that the implemented approach (implementing stake-based congestion control on top of quinn CC algorithm by varying the receive window size from 2 to 10 and limiting number of streams per connection) is optimal solution. Optimal in sense of TPS, yet we might optimize some other metrics or it might be good enough for the moment.

aeyakovenko commented 1 year ago

@KirillLykov can we let validators tag their "trusted" RPCs and inherit the stake? Unstaked RPCs are like the public ports, they need rate limiting because it's free to abuse them.

KirillLykov commented 1 year ago

@KirillLykov can we let validators tag their "trusted" RPCs and inherit the stake? Unstaked RPCs are like the public ports, they need rate limiting because it's free to abuse them.

To me, sounds reasonable. What do @mschneider, @janlegner, @Cerbah think? Sorry if overcommunicate, just want to be sure that everyone is on the same page

ilya-bobyr commented 1 year ago

While it seems absolutely right to split users into different groups for DOS (and just overload) protection, I wonder if receive window size is the best tool for this job.

My understanding is that it restricts the amount of data that a client can have “in flight” for a given connection. For maximum performance, the window size should be equal to the speed of the connection multiplied by the round trip time[1]. Call it “connection network capacity”. When the window size is smaller, the speed is reduced. When the window size is half of the network capacity, the client is experiencing half the maximum possible speed for their connection.

[1] Quinn documentation even says it about the receive_window parameter:

This should be set to at least the expected connection latency multiplied by the maximum desired throughput. Larger values can be useful to allow maximum throughput within a stream while another is blocked. https://docs.rs/quinn/latest/quinn/struct.TransportConfig.html#method.receive_window

The above means that while windows side does affect the amount of data sent (and received) over the network, it may affect users in a way that I would not call fair. Due to the fact that the round trip time is part of the equation.

Take two clients with the same network speed, and the same window size. If the first has twice the round trip time, it will see half the speed. So, if you are closer to the validator, you get better service. It also means that increasing your connection speed also directly increases the amount of data a validator will process for you. Does not seem very fair, if you are in the same category as everyone else. But this could be arguable.

DOS and overload protection is a complex problem. And I have limited experience, but my understanding is that good solutions involve several components. First, one needs to have a QoS criterion. It could be a number of transactions processed and/or some parameter of a transaction processing time, such as the 95% quantile. Or something like that. Could be as simple as the byte count processed for a given user.

Next, users are split into categories – this is already done. But adding validated users seems like a good idea.

Then, a mechanism is used to prioritize user requests based on the QoS. One such mechanism could be dropping packets when a certain category needs to be slowed down. I'm still learning QUIC, but I am quite familiar with TPC congestion control. Dropping packets will trigger congestion control, which will start slowing down affected connections.

But, for TCP, it is also fine, and is actually much less disruptive, to just not read the received data. Congestion control for TCP reacts to the speed of consumption of the delivered data by the application. And if the application is busy, TCP will slow down the connection accordingly. Packet drops are generally only used by intermediate nodes – ones that do not have direct control over the application behavior. Or as a mechanism for a more aggressive intrusion.

I'm still reading the QUIC RFCs. But it does seem to be using similar congestion control mechanisms to TCP. Which makes me think that if a validator will just stop reading data on a connection, the connection will be slowed down by congestion control in the best way possible.

A validator would still need some kind of QoS metric to understand when should it start or stop reading the data. Maybe just measuring all bytes received over staked and unstaked connections could be a good first approximation. (Though, I wonder, if the CPU usage could be another important metric.) A validator will need to track the total number of the data received over a period of time by all connections in each category. And there would be some QoS guarantees, for example, 60% goes to validators, 30% goes to validated users and 10% goes to unvalidated users. When the total number of packets received for all categories exceeds a certain threshold, for a category that is exceeding its quota, the validator will need to start pushing back.

Implementation wise, there are some pretty nice algorithms, for example, using credit systems. That are both efficient and are well researched and tested. Making it possible to push back proportionally to the usage. This way, validators should not be affected if validated users are overloading the system – only those users will be affected. And proportionally to their own usage.

This system would still be relatively simplistic as far as DOS systems go. In particular, it requires an explicit maximum limit to be specified. Also, if a validator load is not linear enough relative to the network load, it might not be able to protect users well enough.

KirillLykov commented 1 year ago

@ilya-bobyr thanks for the input. I think we need to invite some people working on quic to comment: @lijunwangs @pgarg66 @ryleung-solana

mschneider commented 1 year ago

@ilya-bobyr great input!

While it seems absolutely right to split users into different groups for DOS (and just overload) protection, I wonder if receive window size is the best tool for this job. ... The above means that while windows side does affect the amount of data sent (and received) over the network, it may affect users in a way that I would not call fair. Due to the fact that the round trip time is part of the equation.

Agree with receive window discriminating by round-trip, this will effectively create a stronger validator concentration: stake closer to the stake epicenter will be able to achieve higher income for selling it's stake guaranteed bandwidth and hence achieve higher staking rewards than stake further away.

DOS and overload protection is a complex problem. And I have limited experience, but my understanding is that good solutions involve several components. First, one needs to have a QoS criterion. It could be a number of transactions processed and/or some parameter of a transaction processing time, such as the 95% quantile. Or something like that. Could be as simple as the byte count processed for a given user.

I would favor KISS, byte / packet count is probably good enough, given that we have a real fee charging execution environment for the actual TXs. Deep introspection into the program execution stack can be very difficult to replicate upstream when implementing a secondary QoS level in a separate environment. Examples would be proxy services like m-transaction or special hardware, like a sig-verify ASIC.

I'm still reading the QUIC RFCs. But it does seem to be using similar congestion control mechanisms to TCP. Which makes me think that if a validator will just stop reading data on a connection, the connection will be slowed down by congestion control in the best way possible.

This seems to be preferable over the current receive window solution. I'm also not sure that receive window "slow-down" can't be gamed by hacking on the protocol level and simply try to keep more packets in-flight on a higher latency connection.

Implementation wise, there are some pretty nice algorithms, for example, using credit systems. That are both efficient and are well researched and tested. Making it possible to push back proportionally to the usage. This way, validators should not be affected if validated users are overloading the system – only those users will be affected. And proportionally to their own usage.

I would favor a solution that does not bundle different users into a handful of categories but actually tracks the granularity of stake owners, possible extended to validated (by stake owners) users. It makes the system resilient to scenarios where an individual (stake whale) can try to overload the QoS system in exactly the time when a lot of profit can be made. We have seen this behavior multiple times on every resource that can be hogged and we should simply design with this in mind.

Out of a bandwidth consumer perspective having a clear relationship between stake and bandwidth would be ideal:

  1. If i have clarity about my own load, I can acquire exactly the amount of stake needed to ensure the required level of bandwidth at best possible latency.
  2. If i know how much bandwidth my stake guarantees, I can properly tune the parameters on a shared forwarding service and handle QoS myself, without ever experiencing back-pressure from the leader.
ilya-bobyr commented 1 year ago

@ilya-bobyr great input!

Thank you!

Implementation wise, there are some pretty nice algorithms, for example, using credit systems. That are both efficient and are well researched and tested. Making it possible to push back proportionally to the usage. This way, validators should not be affected if validated users are overloading the system – only those users will be affected. And proportionally to their own usage.

I would favor a solution that does not bundle different users into a handful of categories but actually tracks the granularity of stake owners, possible extended to validated (by stake owners) users. It makes the system resilient to scenarios where an individual (stake whale) can try to overload the QoS system in exactly the time when a lot of profit can be made. We have seen this behavior multiple times on every resource that can be hogged and we should simply design with this in mind.

I meant categories at the top level of client separation. There have to be some a decision on how the bandwidth should to be split between unstaked and staked clients. Within a given category, such as, between all staked clients, it is possible to have uneven weights, based on the stake amount. At the same time, all unstaked clients are probably treated equally.

I also wonder if there should be 3 top level categories: validators, non-validator clients with stake allocated to a predetermined program, non-validator clients with no stake.

Out of a bandwidth consumer perspective having a clear relationship between stake and bandwidth would be ideal:

  1. If i have clarity about my own load, I can acquire exactly the amount of stake needed to ensure the required level of bandwidth at best possible latency.

  2. If i know how much bandwidth my stake guarantees, I can properly tune the parameters on a shared forwarding service and handle QoS myself, without ever experiencing back-pressure from the leader.

It would indeed be great to have this kind of properties. But I wonder if it is that easy to calculate, considering that other participants come and go, and change their stake size as well? Could be a good first step, nevertheless :) Back-pressure might still be desirable for a robust system. In particular, it might be possible to use more than the guaranteed minimum, if the bandwidth is available.

In any case, if bandwidth is a good QoS metric, I think, this kind of system could improve client experience.

mschneider commented 1 year ago

I would assume that in general delegation should only allow to slice one's own piece, not to get more pieces. So there's only staked (including delegated) bandwidth and unstaked bandwidth. I personally don't even think that we need to include any unstaked bandwidth, the network seems to be decentralized enough that most unstaked participants could with very short notice switch to staked relays. But if that is not possible, agreed that having a top-level split is a good solution.

Bandwidth allocation could be updated at epoch boundary (or on any other schedule longer than a few blocks), without restricting users too much. Most bandwidth consumers should have a way to respond to account changes within a few seconds.

Over-subscription in low-load periods and back-pressure to individually guaranteed minimum would be ideal. Bandwidth in terms of number of transactions or bytes both work for me, slight preference for number of transactions / block. Seems like the most likely number of accounting that all participants would agree on.

ilya-bobyr commented 7 months ago

I think this issue still exists.