unexpected low throughput

neevek commented 3 years ago

Hi, Thanks for the hard work.

I am trying to implement a proxy client/server pair using quinn, I get it working quickly thanks to quinn's succinct APIs, but the throughput is quite low.

The client is running on MacOS, and server on Linux, RTT reported by ping is 170ms, while quinn::Connection::rtt() reports 190ms. I have another C++ proxy implementation using TCP, which runs alot faster with the same network link. I can observe that the server is quite fast at writing data to the client, each write with 6k to 8k bytes for most the time, but the client reads 1k to 3k bytes for each attempt and never catches up with server's write speed. I tried setting SND_BUF/RCV_BUF up to 3 megabytes, but nothing changes, and I don't know what to investigate on. I guess there may be some mistakes in my code, The client code for relaying traffic is as the following:

pub async fn serve(&mut self, local_conn_receiver: &mut Receiver<TcpStream>) -> Result<()> {
    let remote_conn = &self.remote_conn.as_ref().unwrap();
    // accept local connections and build a tunnel to remote for accepted connections
    while let Some(local_conn) = local_conn_receiver.recv().await {
        match remote_conn.open_bi().await {
            Ok((remote_send, remote_recv)) => {
                tokio::spawn(Self::handle_stream(local_conn, remote_send, remote_recv));
            }
            Err(e) => {
                error!("failed to open_bi on remote connection: {}", e);
                break;
            }
        }
    }

    info!("quit!");
    Ok(())
}

async fn handle_stream(
    mut local_conn: TcpStream,
    mut remote_send: SendStream,
    mut remote_recv: RecvStream,
) -> Result<()> {
    info!("open new stream, id: {}", remote_send.id().index());

    let mut local_read_result = ReadResult::Succeeded;
    loop {
        let (mut local_read, mut local_write) = local_conn.split();
        let local2remote = Self::local_to_remote(&mut local_read, &mut remote_send);
        let remote2local = Self::remote_to_local(&mut remote_recv, &mut local_write);

        tokio::select! {
            Ok(result) = local2remote, if !local_read_result.is_eof() => {
                local_read_result = result;
            }
            Ok(result) = remote2local => {
                if let ReadResult::EOF = result {
                    info!("quit stream after hitting EOF, stream_id: {}", remote_send.id().index());
                    break;
                }
            }
            else => {
                info!("quit unexpectedly, stream_id: {}", remote_send.id().index());
                break;
            }
        };
    }
    Ok(())
}

async fn local_to_remote<'a>(
    local_read: &'a mut ReadHalf<'a>,
    remote_send: &'a mut SendStream,
) -> Result<ReadResult> {
    let mut buffer = vec![0_u8; 8192];
    let len_read = local_read.read(&mut buffer[..]).await?;

    if len_read > 0 {
        remote_send.write_all(&buffer[..len_read]).await?;
        Ok(ReadResult::Succeeded)
    } else {
        remote_send.finish().await?;
        Ok(ReadResult::EOF)
    }
}

async fn remote_to_local<'a>(
    remote_recv: &'a mut RecvStream,
    local_write: &'a mut WriteHalf<'a>,
) -> Result<ReadResult> {
    let mut buffer = vec![0_u8; 8192];
    let result = remote_recv.read(&mut buffer[..]).await?;
    if let Some(len_read) = result {
        local_write.write_all(&buffer[..len_read]).await?;
        local_write.flush().await?;
        Ok(ReadResult::Succeeded)
    } else {
        Ok(ReadResult::EOF)
    }
}

Ralith commented 3 years ago

Thanks for the report! A few questions to help narrow this down:

What throughput do you see, exactly?
How much CPU activity is there on the client thread?
How well does the bulk benchmark run on the client system?

neevek commented 3 years ago

I run 4 tasks to download the same file concurrently, the overall throughput is 5.8 Mbit/s. (I also run the same test through a TCP proxy, with the throughput 10.2 Mbit/s). The ping result from the client to server is as the following:
```
--- [ip] ping statistics ---
954 packets transmitted, 923 packets received, 3.2% packet loss
round-trip min/avg/max/stddev = 158.415/165.308/285.034/8.474 ms
ping [ip]  0.03s user 0.15s system 0% cpu 15:56.71 total
```
The CPU cost for the test is about 42%. (I also run the same test through a TCP proxy, the CPU cost for the client is only 3.9%, the computer I run the tests on has a 2.6 GHz Dual-Core CPU).

The bulk benchmark on my computer is as the following:


Transferred 1073741824 bytes on 1 streams in 115.50s (8.87 MiB/s)

Stream download metrics:

  │  Throughput   │ Duration

──────┼───────────────┼────────── AVG │ 8.86 MiB/s │ 115.49s P0 │ 8.86 MiB/s │ 115.46s P10 │ 8.87 MiB/s │ 115.52s P50 │ 8.87 MiB/s │ 115.52s P90 │ 8.87 MiB/s │ 115.52s P100 │ 8.87 MiB/s │ 115.52s

Matthias247 commented 3 years ago

What throughput do you see when using tcp instead? With all the recent fixes performance is pretty good (but might still be below on optimized Tcp stack for some scenarios).

In your Code, you might want to try removing the flush calls. I think flush will wait until the peer acked all the data, which takes at least a full Rtt.

On the other hand that should mean your app would wait most of the time, and throughput and cpu load should be even lower.

djc commented 3 years ago

Are you using main or 0.7? If the latter, maybe try a current snapshot?

neevek commented 3 years ago

Hi, @Matthias247 and @djc

As stated in my last comment, I run the same test through a TCP proxy, the throughput is 10.2 Mbit/s, which is about 2 times faster than that of the implementation with quinn. I also tried removing the flush call with no effect. @Matthias247
I was using 0.7.2, and as you suggested, I tried the main branch, also no much improvement, I can still see obvious slowdown viewing webpages with the proxy. I will test it out and provide the result tomorrow morning.

I am wondering if the issue is due to packet loss, but to my understanding, QUIC should be more tolerable to packet loss than TCP, am I right?

Matthias247 commented 3 years ago

Are you running a release build? 9MiB/s in the benchmark is far too low. It should be 100MiB/s and more even on old hardware.

Can you also print the congestion window on the sending side? Does it match the oberserved throughput?

neevek commented 3 years ago

Are you running a release build? 9MiB/s in the benchmark is far too low. It should be 100MiB/s and more even on old hardware.

Oh I just found that the outputs of release build are under the root directory, I rerun the bulk benchmark and the result matches your expectation, but this shocks me that a debug build and release build can result in such huge difference.

Transferred 1073741824 bytes on 1 streams in 9.30s (110.07 MiB/s)

Stream download metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  110.09 MiB/s │     9.30s
 P0   │  110.06 MiB/s │     9.30s
 P10  │  110.12 MiB/s │     9.30s
 P50  │  110.12 MiB/s │     9.30s
 P90  │  110.12 MiB/s │     9.30s
 P100 │  110.12 MiB/s │     9.30s

Can you also print the congestion window on the sending side? Does it match the oberserved throughput?

How to do this?

I also rerun the "4 concurrent download tasks" test after building my code against the main branch, this time I make a release build, remove verbose logging calls, but the throughput is still low.

I think the issue may be related to my network, because the result is not stable, sometimes it's better, sometimes it's worse, I just don't know what to look into.

Matthias247 commented 3 years ago

How to do this?

It's available on the stats() that you can get from a connection. You could simply print it once in a while (or do additional things, like calculating percentiles)

Ralith commented 3 years ago

I can still see obvious slowdown viewing webpages with the proxy. I will test it out and provide the result tomorrow morning.

What were the results? Are you absolutely certain that you were running a release build (compiled with cargo build --release and copied from target/release/, or ran directly with cargo run --release)?

neevek commented 2 years ago

I can still see obvious slowdown viewing webpages with the proxy. I will test it out and provide the result tomorrow morning.

What were the results? Are you absolutely certain that you were running a release build (compiled with cargo build --release and copied from target/release/, or ran directly with cargo run --release)?

Sorry for the delay, I have been very busy recently. I will return to the task in maybe 3 months, I will test it out with latest quinn at that time and let you know.

Ralith commented 2 years ago

Thanks for the update! We'll look forward to details whenever you're able to provide them.

neevek commented 2 years ago

Today I got some time to update my code to use 0.8.0, I noticed that after opening ~120 streams on a single connection, async call to open_bi() responds very slowly, it behaves as if it hangs internally. The doc on open_bi() says ... Streams are cheap and instantaneous to open unless blocked by flow control., I was thinking that the issue may be related to congestion control, so I tried BBR as below (not sure if the setup is correct), but the result is the same. I also used RUST_LOG=trace when building the binary, but I didn't see any helpful debug messages.

For my use case, a few hundreds or up to a thousand of streams on a single connection is required, I wanted to make sure I had used the APIs correctly, following is the key code:

The TransportConfig:

let mut cfg = quinn::ClientConfig::new(Arc::new(crypto));
let mut transport_cfg = TransportConfig::default();
transport_cfg.receive_window(quinn::VarInt::from_u32(1024 * 1024)); //.unwrap();
transport_cfg.send_window(1024 * 1024);
transport_cfg.congestion_controller_factory(Arc::new(congestion::BbrConfig::default()));

let timeout = IdleTimeout::from(VarInt::from_u32(1000 * 5));
transport_cfg.max_idle_timeout(Some(timeout));
transport_cfg.keep_alive_interval(Some(Duration::from_millis(timeout / 2)));

The snippet calling open_bi() in a loop:

while let Some(local_conn) = local_conn_receiver.recv().await {
    match remote_conn.open_bi().await {
        Ok((remote_send, remote_recv)) => {
            tokio::spawn(Self::handle_stream(local_conn, remote_send, remote_recv));
        }
        Err(e) => {
            error!("failed to open_bi on remote connection: {}", e);
            break;
        }
    }
}

And the tokio runtime setup (I am not sure if this is related to tokio):

tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .worker_threads(16)
    .build()
    .unwrap()
    .block_on(async {
        run().await.unwrap();
    });

Matthias247 commented 2 years ago

You don’t seem to have increased the max_streams limit. Once you hit that (on either side), opening a stream needs to wait until a previous one closes

Ralith commented 2 years ago

Flow control and congestion control are independent factors. As @Matthias247 observes, you're almost certainly running into stream flow control limits, which you can relax by configuring a larger maximum number of concurrent streams.

neevek commented 2 years ago

Thank you. Indeed I hit the stream limit, after I configure the max_streams limit as suggested, the issue's gone. Now back to the original throughput test, I simplified the test by downloading a single file and run everything in the same computer, there's only one connection with one stream, the setup is as the following:

{client} -> {local_tcp_proxy_server} -> {quinn_client} -> {quinn_server} -> {remote_tcp_proxy_server} -> {http_server}

The first 3 programs run in the same process, remote_proxy_server is a C++ program running in another process, http_server is an nginx server.

The test is to download a 1GB file. Downloading the file through {local_tcp_proxy_server} takes about 3s, downloading the file through {remote_tcp_proxy_server} directly takes about 800ms, it's ~3.75x slower.

Another issue I am facing is that the opened streams are not closed in time, which causes TCP connections between {client} and {local_tcp_proxy_server} to be kept alive for too long and the program quickly hit the max open files limit. When the streams are opened, I use a loop with tokio::select! for each stream to transfer data (see the code in my original post), all other related TCP connections in the session (a stream) will be closed only when the quinn stream hit EOF in either side. I am not sure if this is related to quinn because for the TCP part I am using tokio's TCPStream, but I don't know how to debug the problem, the code works, but not quite correctly.

neevek commented 2 years ago

Sorry, I accidentally pressed the wrong button.

Ralith commented 2 years ago

The test is to download a 1GB file. Downloading the file through {local_tcp_proxy_server} takes about 3s, downloading the file through {remote_tcp_proxy_server} directly takes about 800ms, it's ~3.75x slower.

Thanks for including specific numbers! I'm guessing this is with Linux on both ends? Translating this into bandwidth, Quinn's running at 2.67 Gbps, and TCP is hitting 10Gbps. That's much faster than most network links; I'm guessing you're running this in a datacenter environment? Quinn is likely CPU bound, though you could confirm this by inspecting CPU load, bearing in mind that there will not be significant parallelism with only a single connection and endpoint.

Low-latency, high-reliability links such as you find in datacenters are best-case for TCP, so this is a challenging test. The largest single CPU cost in Quinn is encryption, which can't really be avoided. A direct TCP connection between your client and your remote TCP proxy is passing plaintext bytes directly, which is much cheaper. You could make the comparison more fair by adding a layer of TLS encapsulation between your client and the remote TCP proxy when testing without Quinn.

Beyond that, optimizing Quinn's CPU use is an ongoing project (see e.g. https://github.com/quinn-rs/quinn/issues/914, https://github.com/quinn-rs/quinn/pull/1219). Hopefully we will continue to make progress with future releases; contributions in this area would also be very welcome. QUIC has been shown to be capable of competing with TLS over TCP on a level playing field, but achieving this in practice given the maturity of TCP implementations is a high bar.

Another issue I am facing is that the opened streams are not closed in time, which causes TCP connections between {client} and {local_tcp_proxy_server} to be kept alive for too long and the program quickly hit the max open files limit

To be clear, are the QUIC streams remaining open for too long, or the TCP sockets? Most TCP implementations keep sockets around for a while after you close them to improve reliability; see e.g. http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html for discussion. This can often be disabled, but doing so changes the behavior of your TCP connections in ways that may not be desirable--the behavior's there for a reason, after all.

neevek commented 2 years ago

Hi, @Ralith

I found the culprit which causes stalls in my previous tests, it is tokio::select!. I used select! to schedule code for reading and writing data between TCP connections and Quinn streams, but the code for reading from the TCP streams stalls at times, according to the doc on tokio::select!:

By running all async expressions on the current task, the expressions are able to run concurrently but not in parallel. This means all expressions are run on the same thread and if one branch blocks the thread, all other expressions will be unable to continue. If parallelism is required, spawn each async expression using tokio::spawn and pass the join handle to select!.

I switched to tokio::spawn!, the issue is gone, and my code with Quinn starts to fly!

Low-latency, high-reliability links such as you find in datacenters are best-case for TCP, so this is a challenging test. The largest single CPU cost in Quinn is encryption, which can't really be avoided. A direct TCP connection between your client and your remote TCP proxy is passing plaintext bytes directly, which is much cheaper. You could make the comparison more fair by adding a layer of TLS encapsulation between your client and the remote TCP proxy when testing without Quinn.

It was my fault, I didn't think about that, the comparison was indeed not fair. I re-run the test by putting a TCP server with encryption in front of the original proxy server, now the one with Quinn is ~1.8 times faster than the TCP version, and it costs slightly more CPU than the TCP version, I still don't have time to run a strict comparison, but from what I've seen so far, the Quinn version must be better.

Another thing that surprised me was that the Quinn version with BBR congestion control performed A LOT better than the TCP version. On a network with RTT 160ms and 7.2% packet loss, the TCP version runs at 0.45MiB/s while the Quinn version runs at 1.6MiB/s, this is more than 3 times faster.

At the end, I must say Thank you to you all, the authors of Quinn for the hard work.

Ralith commented 2 years ago

I used select! to schedule code for reading and writing data between TCP connections and Quinn streams, but the code for reading from the TCP streams stalls at times

Hmm, that's interesting; if you're using tokio's TCP primitives, they should be properly concurrent. Might be worth raising an issue upstream with tokio, as an async TCP read should never block the thread. It's not impossible that the spawn-enabled parallelism makes all the difference despite that, but this sounds like a surprisingly large effect for a workload that I assume is mostly just copying bytes.

I re-run the test by putting a TCP server with encryption in front of the original proxy server, now the one with Quinn is ~1.8 times faster than the TCP version, and it costs slightly more CPU than the TCP version

Awesome, thanks for trying that and reporting back!

On a network with RTT 160ms and 7.2% packet loss, the TCP version runs at 0.45MiB/s while the [BBR] Quinn version runs at 1.6MiB/s, this is more than 3 times faster.

We don't actually have a lot of data for the real-world performance of our BBR implementation (hence the experimental warning in the docs), so this is great to hear! Credit to @FrankSpitulski for all his hard work there.

I'm glad Quinn has proven an effective solution for you; thanks for sticking with it through this lengthy performance investigation!

quinn-rs / quinn

unexpected low throughput #1231