quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.57k stars 364 forks source link

seems like quinn 0.11 not working well under heavy load #1867

Closed szguoxz closed 1 month ago

szguoxz commented 1 month ago

I can not be sure. But my debug tells me it only can be quinn problem. :-) I am not sure if I hit the bug you guys fixed in 0.11.1. Since you didn't release to crates.io, not sure how to use 0.11.1.

Anyway, seems the latest release not as stable as 0.10. But I could be wrong! It seems the data got stuck under heavy load, can't be sent on the stream.

Ralith commented 1 month ago

quinn-proto 0.11.1 was released on crates.io 9 days ago. Are you using it?

This isn't an actionable report. What is the specific behavior? Do you have a reproducible test case?

szguoxz commented 1 month ago

Oh, I went to crates.io, I saw quinn is 0.11, I thought quinn-proto is the same version. Yes, I am using the latest quinn-proto version, 0.11.1. My connection got stuck from time to time, I can't figure out how to reproduce it yet. I happens in days, some times in minutes if I am lucky.

I am still trying to find a way to prove it's the stream, but not yet, maybe it's my own problem.

Ralith commented 1 month ago

What exactly does "got stuck" mean? Is the sender unable to write data to a stream? Is the receiver unable to read data that was successfully written? Are other functions of the connection degraded in any way?

There have been some reports of stream flow control issues in https://github.com/quinn-rs/quinn/issues/1818; I wonder if that might be related. If this is a flow control issue, then you should see all previously written data successfully received, but an inability to write new data. You can track this by logging the total number of bytes written to/read from the stream in question.

I happens in days, some times in minutes if I am lucky.

Can you run your workload many times concurrently to deliberately trigger the behavior more frequently?

Ralith commented 1 month ago

Some interesting internal Quinn state you could try to capture when your application stops making progress: Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}
szguoxz commented 1 month ago

yes, it seems a flow-control problem. It's working fine, and suddenly, it cant' write new data. The write_all got stuck. Well, that's just my guess, I am still trying to log stuff to back my guess.

szguoxz commented 1 month ago

It seems these infos are not publicly available?

Some interesting internal Quinn state you could try to capture when your application stops making progress: Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}
szguoxz commented 1 month ago

I did a test, I am creating a VPN. sending packet through Quic. using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame. I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable. Bi-stream is much more stable.

It still hangs from time to time, I still can't find out why yet. But I can be pretty sure it's because the data can't be sent somehow. Not only can't be sent, but some block the flow. i.e, write_all().await got stuck.

Very tough to re-produce. will continue to watch.

Ralith commented 1 month ago

It seems these infos are not publicly available?

Yes, they are internal Quinn state. You can use a modified version of Quinn to insert whatever logging or getters you like.

Ralith commented 1 month ago

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

szguoxz commented 1 month ago

Is there a way to require the stream to send data and receive ACK in a certain time frame? If it does not get the ack back in time, the stream invalidate the connection?

I think what I am looking for is a "ACK timeout" setting on transport config. Is it possible?

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

Ralith commented 1 month ago

The health of a connection is independent of the state of an individual stream. If a connection is healthy, then so are its streams. If a peer stops responding, the connection will time out according to the idle timeout.

Ralith commented 1 month ago

Did you root-cause your issue? Is there something we could document better to avoid similar issues in the future?