smoltcp-rs / smoltcp

a smol tcp/ip stack
BSD Zero Clause License
3.63k stars 402 forks source link

Don't delay ACKs for significant window updates #935

Open lrh2000 opened 1 month ago

lrh2000 commented 1 month ago

Hi smoltcp maintainers, thanks for your great work!

I've been experiencing extremely low performance (< 1Gbps) for two TCP sockets connected via the loopback device (PoC attached at the end of this PR). Further investigation reveals that ACKs carrying significant window updates are being delayed due to the delayed ACK mechanism. Although disabling the ACK delay serves as a workaround, I think it's better to go ahead and fix the bug here, so I'm submitting this PR.

According to the Linux kernel implementation, ACKs are always sent immediately when the receive window is significantly increased. Here "significantly" means doubling the receive window. However, this logic does not seem to be implemented by smoltcp, where no ACKs will be sent unless the delay ACK timer expires: https://github.com/smoltcp-rs/smoltcp/blob/ef67e7b46cabf49783053cbf68d8671ed97ff8d4/src/socket/tcp.rs#L2205-L2207

Another Problem

When I take a look at the window_to_update function, I also notice another problem: In the following code snippet, self.remote_last_win is the length of the last receive window, which cannot be directly compared with the length of the current receive window, because the start positions of the two windows are different. https://github.com/smoltcp-rs/smoltcp/blob/ef67e7b46cabf49783053cbf68d8671ed97ff8d4/src/socket/tcp.rs#L2133-L2142

For example, suppose the sender has sent packets that fill all the slots in the receive window, which means the sender cannot send any more packets unless the receiver responds with a window update. To be aware of this fact, the effective length of the last receive window should be treated as zero instead of the previously advertised length.

I've added a new function called last_scaled_window to fix the second problem, which should work in the same way as the tcp_receive_window function found in the Linux kernel implementation.

PoC

The PoC is modified from examples/loopback.rs:

#![cfg_attr(not(feature = "std"), no_std)]
#![allow(unused_mut)]
#![allow(clippy::collapsible_if)]

#[cfg(feature = "std")]
#[allow(dead_code)]
mod utils;

use log::debug;

use smoltcp::iface::{Config, Interface, SocketSet};
use smoltcp::phy::{Device, Loopback, Medium};
use smoltcp::socket::tcp;
use smoltcp::time::Instant;
use smoltcp::wire::{EthernetAddress, IpAddress, IpCidr};

fn main() {
    let device = Loopback::new(Medium::Ethernet);

    #[cfg(feature = "std")]
    let mut device = {
        utils::setup_logging("info");

        let (mut opts, mut free) = utils::create_options();
        utils::add_middleware_options(&mut opts, &mut free);

        let mut matches = utils::parse_options(&opts, free);
        utils::parse_middleware_options(&mut matches, device, /*loopback=*/ true)
    };

    // Create interface
    let mut config = match device.capabilities().medium {
        Medium::Ethernet => {
            Config::new(EthernetAddress([0x02, 0x00, 0x00, 0x00, 0x00, 0x01]).into())
        }
        Medium::Ip => Config::new(smoltcp::wire::HardwareAddress::Ip),
        Medium::Ieee802154 => todo!(),
    };

    let mut iface = Interface::new(config, &mut device, Instant::now());
    iface.update_ip_addrs(|ip_addrs| {
        ip_addrs
            .push(IpCidr::new(IpAddress::v4(127, 0, 0, 1), 8))
            .unwrap();
    });

    // Create sockets
    let server_socket = {
        // It is not strictly necessary to use a `static mut` and unsafe code here, but
        // on embedded systems that smoltcp targets it is far better to allocate the data
        // statically to verify that it fits into RAM rather than get undefined behavior
        // when stack overflows.
        static mut TCP_SERVER_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_SERVER_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let client_socket = {
        static mut TCP_CLIENT_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_CLIENT_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let mut sockets: [_; 2] = Default::default();
    let mut sockets = SocketSet::new(&mut sockets[..]);
    let server_handle = sockets.add(server_socket);
    let client_handle = sockets.add(client_socket);

    let dummy_data = vec![0; 4096];

    let start_time = Instant::now();

    let mut did_listen = false;
    let mut did_connect = false;
    let mut processed = 0;
    while processed < 1024 * 1024 * 1024 {
        iface.poll(Instant::now(), &mut device, &mut sockets);

        let mut socket = sockets.get_mut::<tcp::Socket>(server_handle);
        if !socket.is_active() && !socket.is_listening() {
            if !did_listen {
                debug!("listening");
                socket.listen(1234).unwrap();
                did_listen = true;
            }
        }

        while socket.can_recv() {
            let received = socket.recv(|buffer| (buffer.len(), buffer.len())).unwrap();
            debug!("got {:?}", received,);
            processed += received;
        }

        let mut socket = sockets.get_mut::<tcp::Socket>(client_handle);
        let cx = iface.context();
        if !socket.is_open() {
            if !did_connect {
                debug!("connecting");
                socket
                    .connect(cx, (IpAddress::v4(127, 0, 0, 1), 1234), 65000)
                    .unwrap();
                did_connect = true;
            }
        }

        while socket.can_send() {
            debug!("sending");
            socket.send_slice(&dummy_data[..]).unwrap();
        }

        // TODO: Call `poll_delay` and sleep
    }

    let duration = Instant::now() - start_time;
    println!(
        "done in {} s, bandwidth is {} Gbps",
        duration.total_millis() as f64 / 1000.0,
        (processed as u64 * 8 / duration.total_millis()) as f64 / 1000000.0
    );
}

Without this fix:

done in 10.778 s, bandwidth is 0.797002 Gbps
cargo run --release --example loopback  10.76s user 0.05s system 99% cpu 10.913 total

With this fix:

done in 0.564 s, bandwidth is 15.231305 Gbps
cargo run --release --example loopback  12.04s user 0.71s system 333% cpu 3.821 total