Closed kundu-subhajit closed 10 months ago
Hey there!
It looks like the linked repo is missing the server code.
I'm also curious, does it work if you don't use a waitgroup/semaphore? Just relying on hyper to automatically not go over the concurrent streams?
Firstly, Appreciate your prompt response. Secondly, My bad. Thanks for pointing out. Repositories are now updated, checked out, compiled and tested for successful reproduction of the issue.
I removed code parts using waitgroup in my personal space and ran the code. It still gets stuck. We still hit the issue. Semaphore is needed to keep check on number of parallel attempts at a time. So kept it.
If you want I can update the repo with waitgroup removed. Code with waitgroup is tested though, for successful runs.
Ok, I was just trying to figure out how small we can make the test case.
Do you have the trace logs? Could you paste them in a comment here inside <details></details>
?
For 2-3 seconds of runtime, its pretty big log. I unable to share the way you asked. Most log files were little more than 25 MB and drag&drop has limit of 25MB.
After couple of runs I arrived at a log file within that limit: hyper-stuck.log.2023-10-12.log
I will still suggest that please run it from the sample code so that you can have bigger runtime to capture any log that is printed while things are stuck.
@jeromegn, @seanmonstar, @jfourie1: Ping.
Please don't ping, especially the others. We're all busy and balancing different things.
If you wish it would get solved faster, you can try digging into the logs and the source code and determining exactly what has it stuck.
Noted.
Problem was with hyper-openssl, which is probably bit outdated. With hyper-rustls works as expected (and is actually faster by app. %50%, number of clients does not play a role - I guess it already enough parallel on one client).
Can check here https://github.com/izderadicka/hyper-stuck
I was too too quick with conclusions sorry. Previous example worked because it uses HTTP/1.1. Just switching to HTTP/2.0 causes same problem as reported, so it's not TLS implementation probably.
It demonstrates like this:
Updated version which demonstrates problem will be on https://github.com/izderadicka/hyper-stuck
I've done one more test - against Rust hyper server (added to repo) instead and now client works fine, without problems. So it must be something between Go server and Rust client on HTTP/2.0 protocol. I know very little of Go, however it looks like server is simple - all defaults, nothing suspicious there.
Also one interesting finding - client against Rust server on HTTP/2.0 is about 4 times slower, then on HTTP/1.1 - against same server, only switched protocols, this is bit contra-intuitive. Go client on the other hand is little bit faster against Rust server, that it was against Go server.
I wanted to try hyper 1.0 in client, but do not know yet how to setup TLS - 1.0 is still very fresh, hyper-rustls is on 0.14 and did not have time to dig deeper how to manually create appropriate connector.
And one more final finding - from trace log - the log from stuck run against Go server have a lot of stream capacity is 0
messages at the end of log, when communication got stuck , I do not see anything like this in the log of successful run against Rust server. Here is excerpt from the log:
2023-11-18T16:49:56.726566Z TRACE Connection{peer=Client}:poll: h2::proto::streams::prioritize: schedule_pending_open
2023-11-18T16:49:56.726571Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726575Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=235335 eos=true window=0 available=0 requested=235335 buffered=235335
2023-11-18T16:49:56.726580Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726585Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726590Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=238437 eos=true window=0 available=0 requested=238437 buffered=238437
2023-11-18T16:49:56.726594Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726599Z TRACE Connection{peer=Client}:poll:FramedWrite::flush: h2::codec::framed_write: flushing buffer
2023-11-18T16:49:56.726603Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.clone" task.id=5
2023-11-18T16:49:56.726607Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.drop" task.id=5
I've done bit of additional exploration:
stream capacity is 0
in logstream capacity is 0
in logExcellent work investigating! So, is it related to flow control? Is there connection window available, but not stream? It's fine to investigate with hyper v0.14, it's most likely something in the h2
dependency, which is the same in both cases.
Yes it's same in 0.14 and 1.0 - so I assume it could be h2.
I' m not expert in HTTP/2.0, I have only general knowledge.
I think it might be related to flow control - what seems to be happening, is that frames stop to be send after stream capacity is indicated as 0. Connections are still open - If I end server client quickly ends by errors on connections - connection closed.
Not sure what you mean exactly by "Is there connection window available, but not stream?" - Is there any way how to check this?
I'm combing through the logs. It doesn't seem like there's an easy way to see the connection window. But as I scroll through the logs, I do eventually see mention of "buffered" for a stream being really high, and things like conn=512
or even smaller. And then there's a trickle of tiny WINDOW_UPDATE
frames from the peer.
It does seem like the hyper client is getting starved via flow control. It's quite interesting that a Rust server makes the client happy. But I wonder what the Go server does differently...
Can you change your program to print the status code of each response? I notice that the Go server will try to read the full request body, but if there's some sort of error, it will return a 400 response. I wonder if any of those happen, and if because of it, some window capacity isn't released for that stream in particular. Or if the buffered data is no longer allowed to keep being sent.
Hi, responses have 200 status, what is weird, that with provided params it gets stuck quite quickly - only few responses get back and then it freezes, this is output with added printout of status:
cargo run --bin hyper-stuck
Finished dev [unoptimized + debuginfo] target(s) in 0.07s
Running `target/debug/hyper-stuck`
Iteration i=0
Iteration i=100
Iteration i=200
Iteration i=300
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
So I do start all 300 requests (each in separate task), I got only 4 back, then it's stuck. I have attached complete tracing log, some interesting notes:
At the time of stuck (~18:34:50) a lot of "stream capacity is 0" as already noted before, and streams (grepped by id - see below), usually ends by this message
At the time of stuck alot of Goways like this one:
2023-11-22T18:34:50.543203Z DEBUG runtime.spawn{kind=task task.name= task.id=30 loc.file="src/main.rs" loc.line=154 loc.col=9}:runtime.spawn{kind=task task.name= task.id=193 loc.file="/home/ivan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.1/src/rt/tokio.rs" loc.line=35 loc.col=9}:Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
It's interesting to look at few first streams by id (I think odd ids are from client)
$ grep -i "StreamId(7)" hyper-stuck.log.2023-11-22 | wc -l
44
$ grep -i "StreamId(9)" hyper-stuck.log.2023-11-22 | wc -l
683
So for streams up to 7 (lower ids not shown, they have exactly same number of messages) it went well I think - it corresponds probably to responses I got back, however from stream 9 up it gets messed up - it looks similar for following streams up to very last streams, which behaves differently:
$ grep -i "StreamId(253)" hyper-stuck.log.2023-11-22 | wc -l
181
$ grep -i "StreamId(255)" hyper-stuck.log.2023-11-22 | wc -l
174
$ grep -i "StreamId(257)" hyper-stuck.log.2023-11-22 | wc -l
2
$ grep -i "StreamId(259)" hyper-stuck.log.2023-11-22 | wc -l
0
I guess by comparing individual streams - especially say 7 with 9 some insight might be gained. The only think I noticed that stream 7 did not have any WINDOW_UPDATE frames received, but problematic stream 7 already has quite few of them - 27.
Did some investigating, it seems like all the streams are starved on connection capacity and couldn't proceed to the next state. Happens because the high Fut limit means the streams are fighting for the capacity (and some have reserved capacities).
Log: https://gist.github.com/dswij/9cd727705018f41a7e627b995584deed
Although popping pending_capacity
makes the requests progressed, but hangs again after a while (probably because of wrong implementation?). So, might need to comment pending_capacity.pop
to correctly replicate the hang (and see OOPS
).
If we change the fut limit (i.e. HTEST_FUT_LIMIT
in the repro code) to 100, matching the concurrent stream limit on the Go server, the client works perfectly (with no OOPS
).
@dswij thanks so much for digging! I really want to get this fixed, but haven't had time between hyper 1.0, upgrading reqwest, and end of year needs. Would you want to look into this further, and recommend a fix for hyper?
Sure, I'll have a go at it
As an update, @dwij has a proposed change that appears to fix a reported case in this issue. If you'd like to take a look, that'd be a huge help! https://github.com/hyperium/h2/pull/730
The fix has been published in h2
v0.4.1!
Version 0.14.27
Platform Linux ubuntu 4.15.0-213-generic # 224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Short Description of the issue Client network communications using Rust hyper library continues to stuck if too many requests are spawned.
Description I see this bug has been earlier reported in #2419 and also has another manifestation in #3041. Both are marked fixed. #3041 also says the fix is tested with 100 requests on 50 streams.
Generally speaking, HTTP2 Servers allows 250 parallel streams. However, client can create unlimited streams. The RFC has the following recommendation: "It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism."
We have a RUST client implementation, using hyper 0.14.27, along with a GoLang Server. Client requests are restricted to 400 at any point of time. Go Client runs smoothly. hyper client gets stuck every time. The bug is easily reproducible. Both RUST client and Go Server implementation can be found in the link below.
Where is the sample code that hits the issue? Sample Code Link: https://github.com/kundu-subhajit/hyper-stuck It has a GoLang Server implementation, Rust Client implementation using Hyper-0.14.27 & a GoLang Client implementation. The build instructions are very straightforward
Any known workaround with the sample code? We do acknowledge that this same issue isn't reproducible with semaphore count below 250 in the sample code shared.
Why we cant use smaller number of parallel streams as workaround? In our local setup we have 100 as max number of HTTP/2 streams at server. But even if we try with semaphore count of 90, network communications through hyper are stuck. Trying to isolate that we arrived here and at the sample code.
What we have done differently in the sample code compared to previous reporting? Nothing much. We have done only two minor things differently here:
CC: @jeromegn, @seanmonstar, @jfourie1 Please help us fix this bug at your earliest. Any help with this is highly appreciated. I thank you in anticipation. Thanks.