fix: streams awaiting capacity lockout

dswij commented 9 months ago

Fixes https://github.com/hyperium/hyper/issues/3338

This PR changes the the assign-capacity queue to prioritize streams that are send-ready.

This is necessary to prevent a lockout when streams aren't able to proceed while waiting for connection capacity, but there is none.

seanmonstar commented 9 months ago

Thank you so so much! I'm wondering, would it be complicated to make a unit test that exercises this? Or maybe one of the reporters would like to try the patch and confirm the problem is gone.

dswij commented 9 months ago

I tested it on the minimal repro sample provided in the issue, and it seems to be working fine. It also fixes this issue:

for less then 300 concurrent, say 250 - few initial requests are done quickly - 5-8 similar to above. Then it pauses for few seconds, then it catches up and finishes all remaining

https://github.com/hyperium/hyper/issues/3338#issuecomment-1817521592

would it be complicated to make a unit test that exercises this?

I'm currently trying to see if we can make this happen but haven't had the time yet. Hence, the draft status

Also, is there any benchmark for hyper at the moment? @seanmonstar

seanmonstar commented 9 months ago

That's fantastic! For benchmarks, the end_to_end ones are likely best. You can use a [patch] and then run cargo bench --features full --bench end_to_end http2 to see all the HTTP/2 ones.

jfourie1 commented 9 months ago

Can this change result in a situation where new streams that are not send_ready will never get a chance to become established even if connection capacity is available? If there are send ready streams that continuously consume all the send flow capacity they will always be moved to the front of the pending_capacity queue. I guess this is a tradeoff between per stream throughput vs amount of open streams.

Update: Looking at this in more detail I don't think the scenario in the question above can happen. It is still not clear to me what the root cause of the problem is. Do we ever attempt to assign capacity to streams that are not open yet? I'm curious if adding a check for is_send_ready() before assigning capacity in assign_connection_capacity() will prevent the original problem from happening.

dswij commented 9 months ago

It is still not clear to me what the root cause of the problem is.

What's happening is that the prioritize model in h2 allows user to declare streams over the max limit, and h2 would only open them when it's possible. Currently, however, the extra streams are put into streams awaiting capacity queue.

so here's an example:

User ask h2 to use 100 streams, but the server only accepts 1.
h2 assigns all connection capacity to StreamId(1), all the other 99 streams are put into awaiting capacity
After getting some data from StreamId(1), the server sends multiple WINDOW_UPDATE frame, but in small increments. We assign a capacity to StreamId(3), then push it back to the end of the queue.
Repeat that until all the other streams get some capacity, but they're all waiting for the lower id-s to open.
Connection capacity is assigned badly.

I've added a test to demonstrate an extreme case of this: e13adae

While 7dec3c2 does prevent a lockout, f1f99e0 is a more complete one @seanmonstar.

Haven't had the chance to formally benchmark it (hyper benchmark is broken in my machine somehow), but against the go server it seems fine.

jfourie1 commented 8 months ago

Thanks for the explanation. I think the changes in https://github.com/hyperium/h2/pull/730/commits/f1f99e0cad5a0937c383e250f19b914a131560bc are good.

seanmonstar commented 8 months ago

I can run the benches. I realized the http2 chunked benchmarks could be re-enabled, once that's done I'll compare them.

seanmonstar commented 8 months ago

Alright, here's the benchmarks, ran 3 times each:

master

test http2_consecutive_x1_empty                             ... bench:      19,466 ns/iter (+/- 751)
test http2_consecutive_x1_req_100kb                         ... bench:      61,085 ns/iter (+/- 7,731) = 1676 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      23,145 ns/iter (+/- 2,280)
test http2_parallel_x10_empty                               ... bench:      80,085 ns/iter (+/- 1,178)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  21,561,698 ns/iter (+/- 33,079,216) = 474 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:  29,334,189 ns/iter (+/- 17,471,509) = 349 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,536,371 ns/iter (+/- 442,932) = 2257 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  28,362,346 ns/iter (+/- 1,293,771) = 3697 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  63,976,528 ns/iter (+/- 10,896,099) = 1639 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   2,977,858 ns/iter (+/- 175,547) = 3521 MB/s

test http2_consecutive_x1_empty                             ... bench:      19,250 ns/iter (+/- 835)
test http2_consecutive_x1_req_100kb                         ... bench:      65,900 ns/iter (+/- 7,830) = 1553 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      23,125 ns/iter (+/- 1,029)
test http2_parallel_x10_empty                               ... bench:      80,587 ns/iter (+/- 2,271)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  21,242,009 ns/iter (+/- 33,134,735) = 482 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:   4,728,061 ns/iter (+/- 739,811) = 2165 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,818,920 ns/iter (+/- 312,953) = 2124 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  28,621,065 ns/iter (+/- 1,045,695) = 3663 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  65,335,503 ns/iter (+/- 13,042,825) = 1604 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   2,986,464 ns/iter (+/- 182,535) = 3511 MB/s

test http2_consecutive_x1_empty                             ... bench:      19,958 ns/iter (+/- 615)
test http2_consecutive_x1_req_100kb                         ... bench:      58,819 ns/iter (+/- 3,983) = 1740 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      23,026 ns/iter (+/- 3,322)
test http2_parallel_x10_empty                               ... bench:      80,913 ns/iter (+/- 4,447)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  21,632,048 ns/iter (+/- 32,801,356) = 473 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:  21,337,511 ns/iter (+/- 16,579,399) = 479 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,460,987 ns/iter (+/- 107,547) = 2295 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  28,389,317 ns/iter (+/- 1,629,586) = 3693 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  64,092,171 ns/iter (+/- 10,006,727) = 1636 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   2,887,548 ns/iter (+/- 284,703) = 3631 MB/s

patch

test http2_consecutive_x1_empty                             ... bench:      19,344 ns/iter (+/- 1,090)
test http2_consecutive_x1_req_100kb                         ... bench:      62,291 ns/iter (+/- 4,330) = 1643 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      22,989 ns/iter (+/- 1,305)
test http2_parallel_x10_empty                               ... bench:      79,106 ns/iter (+/- 1,078)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  22,254,082 ns/iter (+/- 33,339,072) = 460 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:  13,325,490 ns/iter (+/- 16,811,166) = 768 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,439,135 ns/iter (+/- 137,301) = 2306 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  28,679,929 ns/iter (+/- 1,500,517) = 3656 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  65,187,943 ns/iter (+/- 10,425,661) = 1608 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   3,064,030 ns/iter (+/- 562,964) = 3422 MB/s

test http2_consecutive_x1_empty                             ... bench:      19,080 ns/iter (+/- 2,813)
test http2_consecutive_x1_req_100kb                         ... bench:      64,433 ns/iter (+/- 8,307) = 1589 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      22,021 ns/iter (+/- 1,723)
test http2_parallel_x10_empty                               ... bench:      79,386 ns/iter (+/- 9,277)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  21,791,421 ns/iter (+/- 33,479,730) = 469 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:   4,766,794 ns/iter (+/- 339,291) = 2148 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,768,157 ns/iter (+/- 315,048) = 2147 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  29,218,333 ns/iter (+/- 2,002,628) = 3588 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  64,073,564 ns/iter (+/- 10,601,740) = 1636 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   2,940,898 ns/iter (+/- 148,619) = 3565 MB/s

test http2_consecutive_x1_empty                             ... bench:      22,374 ns/iter (+/- 1,478)
test http2_consecutive_x1_req_100kb                         ... bench:      81,660 ns/iter (+/- 17,238) = 1253 MB/s
test http2_consecutive_x1_req_10b                           ... bench:      26,591 ns/iter (+/- 761)
test http2_parallel_x10_empty                               ... bench:      81,684 ns/iter (+/- 4,365)
test http2_parallel_x10_req_10kb_100_chunks                 ... bench:  22,456,080 ns/iter (+/- 32,524,006) = 456 MB/s
test http2_parallel_x10_req_10kb_100_chunks_adaptive_window ... bench:  21,630,419 ns/iter (+/- 17,674,624) = 473 MB/s
test http2_parallel_x10_req_10kb_100_chunks_max_window      ... bench:   4,471,232 ns/iter (+/- 248,682) = 2290 MB/s
test http2_parallel_x10_req_10mb                            ... bench:  28,413,500 ns/iter (+/- 2,303,337) = 3690 MB/s
test http2_parallel_x10_res_10mb                            ... bench:  67,332,789 ns/iter (+/- 14,450,978) = 1557 MB/s
test http2_parallel_x10_res_1mb                             ... bench:   3,005,929 ns/iter (+/- 521,638) = 3488 MB/s

seanmonstar commented 8 months ago

Looking through the results, the differences look like just noise. Maybe there's a workload that matters that the benchmarks don't catch, but 🤷 . If you concur, we can merge this!

jeromegn commented 8 months ago

I didn't work on this, but it does look like noise to me. Some of them are highly variable in both master and the patch (e.g. http2_parallel_x10_req_10kb_100_chunks_adaptive_window).

seanmonstar commented 8 months ago

adaptive_window

I noticed that too. I think the issue is that the benchmarks are going over localhost, which means adaptive window which tries to detect bandwidth will have a horrible time. A cool project to improve the adaptive window option would be to setup some tests using turmoil where we can simulate latency.

seanmonstar commented 8 months ago

Thank you so much @dswij, excellent debugging work here ❤️

seanmonstar commented 8 months ago

Shipped as v0.4.1 :)

hyperium / h2

fix: streams awaiting capacity lockout #730

master

patch