akoshelev commented 3 months ago

It was diagnosed as root cause for #1085. Here are the reasons why we think it is safe to turn it off:

We have flow control built in at the application layer. OrderingSender provides backpressure per stream/step and MPC protocol logic breaks the whole execution into chunks where we open large but limited number of streams.
We have flow control at TCP layer
Connection level flow control for HTTP2 connections works well for web-servers that don't control clients that can connect to it. In our case, we have a very limited set of clients for each MPC-to-MPC connection.

I had a chat with @andyleiserson, and we think ultimately it will be beneficial if we only used one mechanism to provide and given the nature of IPA, it would be better to have full control there. This means that application-level mechanism will be our preferred way to establish congestion control and turning off transport-layer flow control fits into our long-term view for IPA.

This change also reverts the active window back to 1024. I am concerned that the attribution phase may run slower with 32 and if PRF can work with 1024, even at the expense of dropping very large buffers down to the network, we can live with it for now.

Next step I want to implement variable-sized send buffers to maintain the guarantee from our infrastructure to flush fixed-size chunks down to the network layer

Testing

Local run for 50k left overnight. draft run succeeded

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.94%. Comparing base (d2d871c) to head (a362d66).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1104 +/- ## ========================================== - Coverage 90.94% 90.94% -0.01% ========================================== Files 184 185 +1 Lines 26076 26097 +21 ========================================== + Hits 23715 23734 +19 - Misses 2361 2363 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

akoshelev commented 3 months ago

But it doesn't really matter -- either way, we should keep it in mind for future tuning, and I'm pretty sure that that disabling buffering for vectorized channels will make a bigger difference than 32 vs. 1024.

yea, I am working on the change that fixes the size of chunks that we flush down to the network layer. I did some profiling and 32 is a bit too low for shuffle - I am seeing 500 byte chunks sent over and over, so it does seem to take longer to finish. Maybe it is not important for the protocol, but I think (hope) fixed-size chunks will get vectorized channels to the right place in terms of performance

martinthomson commented 2 months ago

I just want to note that the summary here includes a couple of notes that the folks working on HTTP/2 (and HTTP/3) would be very interested to hear. It's important to understand that connection-level flow control exists to protect endpoints from certain abuse scenarios, as noted.

However, TCP flow control is not an effective substitute because of how HTTP/2 works. There is also no similar defense for HTTP/3.

What we're saying here is that we believe this to be problematic. That suggests implementation issues, like not growing credit properly, more than it says that connection-level flow control is fundamentally undesirable.

akoshelev commented 2 months ago

Fair point @martinthomson, I don't see any fundamental issues with HTTP/2 connection-level and/or stream-level flow control. Both exists for a good reason. I do agree that the issue of not scaling the window is likely due to generic implementation of HTTP client that we use in IPA that does not take advantage of HTTP/2 specifics and just stalls when the default connection window is exhausted.

https://github.com/hyperium/hyper/issues/1960 seems to be hinting that it may be the case.

1107 mentions migrating to reqwest that may be a more sophisticated implementation of HTTP2 client, we can check that.

Even with the optimal connection-level flow control implemented in network layer, I think the ultimate solution will still require work on the application side. If we are getting close to saturation, we need to prioritize streams that belong to "early"/lower steps in MPC and connection-level flow control may not help us here.

I am thinking about TCP windowing as last layer of defense where you need to do more damage control than actual work. Our application gets more or less consistent throughput regardless of the input size and can be fine-tuned to avoid that.

martinthomson commented 2 months ago

Hah, it's ROC. I'm a little surprised that remains unfixed, but not hugely surprised. We only recently added dynamic window scaling to Firefox recently. It turns out that you can go a long time without really needing that.

private-attribution / ipa

Disable connection-level flow control for HTTP2 streams #1104

Testing

Codecov Report

1107 mentions migrating to reqwest that may be a more sophisticated implementation of HTTP2 client, we can check that.