katzenpost / mixnet_uprising

repository for tracking open tasks
18 stars 1 forks source link

Figure out a "good" congestion control strategy. #33

Open Yawning opened 6 years ago

Yawning commented 6 years ago

Right now, all the inter-node traffic is done over TCP/IP links which ensures that the network is internet friendly. However within the network at a higher level, there is no congestion control beyond mixes and providers dropping packets to keep internal queue latency down. On the client side, mailproxy's send scheduler by default will send a packet once every 10 sec (approximate with intentional jitter of +- 5 sec), and while a rather conservative choice, the parameter is tunable and there is nothing to stop misbehaving clients from sending faster (See: katzenpost/katzenpost#159).

This could possibly be improved for better network health though how to do so is not immediately obvious. The main issues with using loss as a signalling for congestion for this network is that it can take considerable amounts of time (dependent on lambda) for the originator of a lost packet to detect the loss, and due to the per-packet path selection the relevance of any given loss incident on subsequent traffic is questionable.

To be put in simpler terms, backing off based on a loss that happened minutes if not hours ago on a part of the network that may not even be utilized, will only lead to poor network utilization.

Yawning commented 6 years ago

The closest equivalent to the per-packet congestion avoidance problem is probably something like low data volume UDP with insufficient return traffic to maintain an RTT estimate. While pedants will point out that mixnet has an RTT estimate per packet, said estimate almost entirely dependent on lambda rather than the status of the actual network.

See: https://tools.ietf.org/html/rfc8085#section-3.1.3

willscott commented 6 years ago

You basically suggest what seems like the logical answer to this problem in https://github.com/katzenpost/katzenpost/issues/159

The entry mix should be able to maintain a much more temporally relevant estimation of congestion /utilization across the overall traffic of the mix, and communicate that back to clients.

Rather than just a TCP window, that seems to suggest that the server should communicate a suggested delay before next transmission, and should be able to discard excessive traffic from noncompliant clients.

david415 commented 6 years ago

here's my thoughts on this:

david415 commented 6 years ago

perhaps after some future research, we could do weird things like use cryptographic receipts exchanges between mixes to determine congestion. the miranda paper uses them to defend against n-1 attacks but i think it could easily apply to congestion control. actually many years ago @daira wrote a similar paper called "A Reputation System to Increase MIX-Net Reliability".

[MIRANDA] Leibowitz, H., Piotrowska, A., Danezis, G., Herzberg, A., 2017, "No right to ramain silent: Isolating Malicious Mixes" https://eprint.iacr.org/2017/1000.pdf.

[MIXRELIABLE] Dingledine, R., Freedman, M., Hopwood, D., Molnar, D., 2001 "A Reputation System to Increase MIX-Net Reliability" In Information Hiding, 4th International Workshop https://www.freehaven.net/anonbib/cache/mix-acc.pdf.

Yawning commented 6 years ago

here's my thoughts on this:

I'm... unconvinced.

daira commented 6 years ago

BTW on rereading my old paper, I think we didn't take this attack (documented at the end of section 4.1) seriously enough:

Note that an adversary may be able to force negative ratings on a MIX, while goal 2 [Reject false claims] still holds: if he floods that MIX’s incoming bandwidth (either directly or via witnesses), the MIX will no longer be able to sustain the load. However, this is exactly the point where the MIX demonstrates that it is unreliable. Causing MIXes to lose reputation in the face of successful flooding is consistent with our scoring system goals: the scoring system measures reliability and capabilities, not intent.

The loose global timing synchronization assumption is also really ugly, and I still don't know how to do better.

david415 commented 6 years ago

other open questions:

Yawning commented 6 years ago

can we use source quench/explicit congestion notification to get congestion information to clients faster than waiting for round trip timeouts?

Only between the client and provider. Interior nodes do not have an idea of where to send such notifications to for reasons that should be obvious, and I would be fairly solidly against including a SURB per hop in forward traffic.

if we don't do exponential backoff then what does it take to cause congestion collapse?

ITYM "mitigate congestion collapse". The solutions that come to me as "viable" are (in decreasing order of preference):

I'm also far from convinced that each client should be constantly sampling lambdaP and always sending cover traffic when idle, because that wastes a gigantic amount of bandwidth when everyone does that. How to balance out "not burning bandwidth like crazy" vs "it's stupid to only send cover traffic when you're sending messages" to me is an open research question.

how do we adjust the lambda-p value for exponential backoff in response to congestion notification?

You don't. It's a network wide parameter. The authority can increase/reduce it based on anonymity requirements, or set it to something super conservative if the network is under sustained overload, but the opportunity to adjust the value is once every 3 hours, firmly cementing it as a coarse and blunt instrument.

david415 commented 6 years ago
  1. I disagree with the first point: "Interior nodes do not have an idea of where to send such notifications to for reasons that should be obvious"

Interior nodes send congestion information to the next outer layer of nodes until it reaches all Providers and finally all clients.

  1. "ITYM 'mitigate congestion collapse'. Wrong again. I meant that I would like to find out how to cause congestion collapse. I want to know what kind of usage statistics cause congestion collapse so that an over provisioned network knows how much longer it has to live at the given growth rate.

  2. I am unconvinced that epoch key rotation (PKI updates) must be set at 3 hours and therefore barred from being a useful mechanism for congestion control.

Yawning commented 6 years ago

Interior nodes send congestion information to the next outer layer of nodes until it reaches all Providers and finally all clients.

That feels relatively worthless. Figure out how much bandwidth amplification this sort of scheme will entail. And if one node has a blip with congestion, so what? What possible useful action can you take? It's stupid to alter the transmit scheduling just because a single link has experienced a congestion event because that will massively under-utilize network capacity....

"ITYM 'mitigate congestion collapse'. Wrong again. I meant that I would like to find out how to cause congestion collapse. I want to know what kind of usage statistics cause congestion collapse so that an over provisioned network knows how much longer it has to live at the given growth rate.

I think the load shedding is too aggressive and the transmit/retransmit timer granularity is too coarse grained for this to be a practical consideration. This will be even more of a non-issue once the server code starts throttling outgoing sends on a per-user basis as well.

"Kind of bad things happen if the network is undersized relative to the number of users. For that, lambda and lambdaP adjustment is sufficient to adjust load till new servers can be added".

I am unconvinced that epoch key rotation (PKI updates) must be set at 3 hours and therefore barred from being a useful mechanism for congestion control.

You're missing the point.

Any realistic epoch interval (even say, a few mins, though that would be considerably shorter than what I'm comfortable with), is entirely too coarse grained to react to dynamic network conditions, because there will still be way too much delay. By the time any sort of lambda adjustment propagates to clients (which will take a while), all the loss that's going to happen, has happened, and the particular link is likely back to normal.

Yawning commented 6 years ago

With something like my autoratelimit server branch (https://github.com/katzenpost/server/tree/autoratelimit), people that try to inject traffic too fast into the network will be forced to use multiple accounts.

The rate limiter is PKI parameter based (it primarily enforces the SendShift parameter), so there is room for dynamic tuning, but like all other PKI parameters this is something that will happen once every epoch.