HTTP/2: Hyper communication is stuck if too many requests are spawned.

kundu-subhajit commented 1 year ago

Version 0.14.27

Platform Linux ubuntu 4.15.0-213-generic # 224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Short Description of the issue Client network communications using Rust hyper library continues to stuck if too many requests are spawned.

Description I see this bug has been earlier reported in #2419 and also has another manifestation in #3041. Both are marked fixed. #3041 also says the fix is tested with 100 requests on 50 streams.

Generally speaking, HTTP2 Servers allows 250 parallel streams. However, client can create unlimited streams. The RFC has the following recommendation: "It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism."

We have a RUST client implementation, using hyper 0.14.27, along with a GoLang Server. Client requests are restricted to 400 at any point of time. Go Client runs smoothly. hyper client gets stuck every time. The bug is easily reproducible. Both RUST client and Go Server implementation can be found in the link below.

Where is the sample code that hits the issue? Sample Code Link: https://github.com/kundu-subhajit/hyper-stuck It has a GoLang Server implementation, Rust Client implementation using Hyper-0.14.27 & a GoLang Client implementation. The build instructions are very straightforward

Any known workaround with the sample code? We do acknowledge that this same issue isn't reproducible with semaphore count below 250 in the sample code shared.

Why we cant use smaller number of parallel streams as workaround? In our local setup we have 100 as max number of HTTP/2 streams at server. But even if we try with semaphore count of 90, network communications through hyper are stuck. Trying to isolate that we arrived here and at the sample code.

What we have done differently in the sample code compared to previous reporting? Nothing much. We have done only two minor things differently here:

We have run the Go Lang server and Rust client on different machines. With at-the-most 400 requests spawned, client gets stuck every time. We have reproduced this issue in two different client machines, both Ubuntu Linux and both the machines hit the issue every time.
We have enabled trace logging to capture logs.

CC: @jeromegn, @seanmonstar, @jfourie1 Please help us fix this bug at your earliest. Any help with this is highly appreciated. I thank you in anticipation. Thanks.

seanmonstar commented 1 year ago

Hey there!

It looks like the linked repo is missing the server code.

I'm also curious, does it work if you don't use a waitgroup/semaphore? Just relying on hyper to automatically not go over the concurrent streams?

kundu-subhajit commented 1 year ago

Firstly, Appreciate your prompt response. Secondly, My bad. Thanks for pointing out. Repositories are now updated, checked out, compiled and tested for successful reproduction of the issue.

kundu-subhajit commented 1 year ago

I removed code parts using waitgroup in my personal space and ran the code. It still gets stuck. We still hit the issue. Semaphore is needed to keep check on number of parallel attempts at a time. So kept it.

If you want I can update the repo with waitgroup removed. Code with waitgroup is tested though, for successful runs.

seanmonstar commented 1 year ago

Ok, I was just trying to figure out how small we can make the test case.

Do you have the trace logs? Could you paste them in a comment here inside <details></details>?

kundu-subhajit commented 1 year ago

For 2-3 seconds of runtime, its pretty big log. I unable to share the way you asked. Most log files were little more than 25 MB and drag&drop has limit of 25MB.

After couple of runs I arrived at a log file within that limit: hyper-stuck.log.2023-10-12.log

I will still suggest that please run it from the sample code so that you can have bigger runtime to capture any log that is printed while things are stuck.

kundu-subhajit commented 1 year ago

@jeromegn, @seanmonstar, @jfourie1: Ping.

seanmonstar commented 1 year ago

Please don't ping, especially the others. We're all busy and balancing different things.

If you wish it would get solved faster, you can try digging into the logs and the source code and determining exactly what has it stuck.

kundu-subhajit commented 1 year ago

Noted.

izderadicka commented 12 months ago

Problem was with hyper-openssl, which is probably bit outdated. With hyper-rustls works as expected (and is actually faster by app. %50%, number of clients does not play a role - I guess it already enough parallel on one client).

Can check here https://github.com/izderadicka/hyper-stuck

izderadicka commented 12 months ago

I was too too quick with conclusions sorry. Previous example worked because it uses HTTP/1.1. Just switching to HTTP/2.0 causes same problem as reported, so it's not TLS implementation probably.

It demonstrates like this:

with several hundreds concurrent tasks app. >= 300
few initial requests - between 4 - 8 are processed immediately
then it stucks, no further requests are received by server
for less then 300 concurrent, say 250 - few initial requests are done quickly - 5-8 similar to above. Then it pauses for few seconds, then it catches up and finishes all remaining

Updated version which demonstrates problem will be on https://github.com/izderadicka/hyper-stuck

izderadicka commented 12 months ago

I've done one more test - against Rust hyper server (added to repo) instead and now client works fine, without problems. So it must be something between Go server and Rust client on HTTP/2.0 protocol. I know very little of Go, however it looks like server is simple - all defaults, nothing suspicious there.

Also one interesting finding - client against Rust server on HTTP/2.0 is about 4 times slower, then on HTTP/1.1 - against same server, only switched protocols, this is bit contra-intuitive. Go client on the other hand is little bit faster against Rust server, that it was against Go server.

I wanted to try hyper 1.0 in client, but do not know yet how to setup TLS - 1.0 is still very fresh, hyper-rustls is on 0.14 and did not have time to dig deeper how to manually create appropriate connector.

izderadicka commented 12 months ago

And one more final finding - from trace log - the log from stuck run against Go server have a lot of stream capacity is 0 messages at the end of log, when communication got stuck , I do not see anything like this in the log of successful run against Rust server. Here is excerpt from the log:

2023-11-18T16:49:56.726566Z TRACE Connection{peer=Client}:poll: h2::proto::streams::prioritize: schedule_pending_open
2023-11-18T16:49:56.726571Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726575Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=235335 eos=true window=0 available=0 requested=235335 buffered=235335
2023-11-18T16:49:56.726580Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726585Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726590Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=238437 eos=true window=0 available=0 requested=238437 buffered=238437
2023-11-18T16:49:56.726594Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726599Z TRACE Connection{peer=Client}:poll:FramedWrite::flush: h2::codec::framed_write: flushing buffer
2023-11-18T16:49:56.726603Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.clone" task.id=5
2023-11-18T16:49:56.726607Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.drop" task.id=5

izderadicka commented 12 months ago

I've done bit of additional exploration:

happens also on hyper v1 (updated demo repo to hyper 1.0.1)
on server side following Go server influences the problem:
- MaxConcurrentStreams - if increased to 1000 then client is not stuck completely - it's just hangs up after few requests, for 10 secs or so than catches up, but overall is quite slow, still see stream capacity is 0 in log
- MaxUploadBufferPerConnection is set up to app. 32M then problem completely disappears, no stream capacity is 0 in log
as already explained above problem is best demonstrated when there is single http2 client, which spawn app. 300 requests in separate tokio tasks. If the are multiple clients (say 10), problems appears too, but much later, after some thousands of requests
go client does not have this problem - at least the one provided there, it should have about 600 spawned requests in parallel in go function.

seanmonstar commented 12 months ago

Excellent work investigating! So, is it related to flow control? Is there connection window available, but not stream? It's fine to investigate with hyper v0.14, it's most likely something in the h2 dependency, which is the same in both cases.

izderadicka commented 12 months ago

Yes it's same in 0.14 and 1.0 - so I assume it could be h2. I' m not expert in HTTP/2.0, I have only general knowledge.
I think it might be related to flow control - what seems to be happening, is that frames stop to be send after stream capacity is indicated as 0. Connections are still open - If I end server client quickly ends by errors on connections - connection closed.

Not sure what you mean exactly by "Is there connection window available, but not stream?" - Is there any way how to check this?

seanmonstar commented 12 months ago

I'm combing through the logs. It doesn't seem like there's an easy way to see the connection window. But as I scroll through the logs, I do eventually see mention of "buffered" for a stream being really high, and things like conn=512 or even smaller. And then there's a trickle of tiny WINDOW_UPDATE frames from the peer.

It does seem like the hyper client is getting starved via flow control. It's quite interesting that a Rust server makes the client happy. But I wonder what the Go server does differently...

Can you change your program to print the status code of each response? I notice that the Go server will try to read the full request body, but if there's some sort of error, it will return a 400 response. I wonder if any of those happen, and if because of it, some window capacity isn't released for that stream in particular. Or if the buffered data is no longer allowed to keep being sent.

izderadicka commented 11 months ago

Hi, responses have 200 status, what is weird, that with provided params it gets stuck quite quickly - only few responses get back and then it freezes, this is output with added printout of status:

 cargo run --bin hyper-stuck
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/hyper-stuck`
Iteration i=0
Iteration i=100
Iteration i=200
Iteration i=300
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0

So I do start all 300 requests (each in separate task), I got only 4 back, then it's stuck. I have attached complete tracing log, some interesting notes:

At the time of stuck (~18:34:50) a lot of "stream capacity is 0" as already noted before, and streams (grepped by id - see below), usually ends by this message

At the time of stuck alot of Goways like this one:

2023-11-22T18:34:50.543203Z DEBUG runtime.spawn{kind=task task.name= task.id=30 loc.file="src/main.rs" loc.line=154 loc.col=9}:runtime.spawn{kind=task task.name= task.id=193 loc.file="/home/ivan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.1/src/rt/tokio.rs" loc.line=35 loc.col=9}:Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }

It's interesting to look at few first streams by id (I think odd ids are from client)

$ grep -i "StreamId(7)" hyper-stuck.log.2023-11-22  | wc -l
44
$ grep -i "StreamId(9)" hyper-stuck.log.2023-11-22  | wc -l
683

So for streams up to 7 (lower ids not shown, they have exactly same number of messages) it went well I think - it corresponds probably to responses I got back, however from stream 9 up it gets messed up - it looks similar for following streams up to very last streams, which behaves differently:

$ grep -i "StreamId(253)" hyper-stuck.log.2023-11-22  | wc -l
181
$ grep -i "StreamId(255)" hyper-stuck.log.2023-11-22  | wc -l
174
$ grep -i "StreamId(257)" hyper-stuck.log.2023-11-22  | wc -l
2
$ grep -i "StreamId(259)" hyper-stuck.log.2023-11-22  | wc -l
0

I guess by comparing individual streams - especially say 7 with 9 some insight might be gained. The only think I noticed that stream 7 did not have any WINDOW_UPDATE frames received, but problematic stream 7 already has quite few of them - 27.

log.zip

dswij commented 10 months ago

Did some investigating, it seems like all the streams are starved on connection capacity and couldn't proceed to the next state. Happens because the high Fut limit means the streams are fighting for the capacity (and some have reserved capacities).

See https://github.com/dswij/h2/blob/9a788d9fe2f10ce21d1e177646a093f55fd073d1/src/proto/streams/prioritize.rs#L750-L760

Log: https://gist.github.com/dswij/9cd727705018f41a7e627b995584deed

Although popping pending_capacity makes the requests progressed, but hangs again after a while (probably because of wrong implementation?). So, might need to comment pending_capacity.pop to correctly replicate the hang (and see OOPS).

If we change the fut limit (i.e. HTEST_FUT_LIMIT in the repro code) to 100, matching the concurrent stream limit on the Go server, the client works perfectly (with no OOPS).

seanmonstar commented 10 months ago

@dswij thanks so much for digging! I really want to get this fixed, but haven't had time between hyper 1.0, upgrading reqwest, and end of year needs. Would you want to look into this further, and recommend a fix for hyper?

dswij commented 10 months ago

Sure, I'll have a go at it

seanmonstar commented 10 months ago

As an update, @dwij has a proposed change that appears to fix a reported case in this issue. If you'd like to take a look, that'd be a huge help! https://github.com/hyperium/h2/pull/730

seanmonstar commented 10 months ago

The fix has been published in h2 v0.4.1!

hyperium / hyper

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338