smoltcp-rs / smoltcp

a smol tcp/ip stack
BSD Zero Clause License
3.72k stars 412 forks source link

Performance drop with increasing buffer size #949

Open HKalbasi opened 2 months ago

HKalbasi commented 2 months ago

I have a smoltcp device which wants to connect to targets with various latency. To increase throughput I use big buffer sizes, but the problem is that there is an "ideal" buffer size which below it throughput scales linearly, it becomes max around the ideal point, and harshly drops if you increase buffer size more than the ideal value. And since targets have different latency amounts, there is no one size fit all buffer size. My intuition was that increasing buffer size only helps smoltcp and should not affect anything if the connection can't reach the big window sizes, so by increasing the buffer size we should always get better throughput (at cost of more memory usage) but it is not the case in my experiments.

I don't have a small repro at the moment but can try to make one if this behavior is not natural.

Dirbaio commented 2 months ago

this can happen if your network device can queue a maximum amount of frames in the rx or tx queues, and drops excess frames instead of buffering them or telling smoltcp to slown down on tx (returning Exhausted). smoltcp (or the remote side) sends a bunch of frames because it sees the window is big, which then overflows the queue, causing packets to get lost.

There's a workaround which is max_burst_size, which caps the amount of in-flight tcp data at a given time.

HKalbasi commented 2 months ago

I think originally this packet loss was the problem, but now that I increased the buffer size there is no packet loss but I still see this slowness. I use libpcap for injecting packets, and originally I had Resource temporarily unavailable error from the libpcap device. Then I increased the SO_SNDBUF of the socket, and now it seems libpcap is happy with my packets, maybe some upstream device drops them now, I can try to match packets on target to see if there is some packet loss.

Now I have two questions:

First, how I can tell smoltcp to slow down? Device transmit only returns Option. Does returning None from it suffice?

Second, how OS tcp stack handles that problem? I tried in a mininet with 3 nodes, in this topology:

node1 ----------------|1Gbps link|------------------------- node 2 ------------------|100Mbps link|---------------- node 3

On node2 there is a small libpcap program that bridges two interfaces with a small amount of buffer ~200 packets, links have 100ms latency, and OS tcp socket is able to reach near 100Mbps speed within seconds but there is no mechanism that notifies OS in node1 that the link capacity is full.

Third, I previously had max_burst_size = Some(1) in my device capabilities, which probably I just copied from some example. But it seems it has no effect since if it was limiting on the fly packets to one, changing the buffer size should not change anything, which is not the case. My device is Medium::Ip if it is relevant. What should I set it if my link has e.g. 100Mbit/s capacity?

Dirbaio commented 2 months ago

First, how I can tell smoltcp to slow down? Device transmit only returns Option. Does returning None from it suffice?

yes. if the phy can't transmit right now, return None. Later, when it's ready to transmit again, poll the interface again.

OS tcp socket is able to reach near 100Mbps speed within seconds but there is no mechanism that notifies OS in node1 that the link capacity is full.

This is done with "congestion control". If node1 sees packets are getting lost, it assumes it's because it exceeded the capacity of some link in the path, and slows down.

Actually, this is something you could try that might help with slowness. The latest release 0.11 didn't have any congestion control at all, but it's been added recently: https://github.com/smoltcp-rs/smoltcp/pull/907. Maybe try using smoltcp from git with congestion control enabled, see if it helps.

The "max burst size" thing is actually kind of a hack to workaround lack of congestion control, but onyl takes into account the local buffer queue's size.

HKalbasi commented 1 month ago

I tried enabling the congestion control, with no success. Screencast from 07-10-2024 01:28:23 PM.webm

The server at 10.0.0.1 is a yes | nc -l 0.0.0.0 8000 & which uses the OS tcp stack and the one in 10.0.0.5 is smoltcp listening on a raw socket.

My code is available here. I created a phy::RawSocket device, used 65535000 as buffer size, removed the panic at smoltcp/src/phy/raw_socket.rs:133:25 since it hits No buffer space available (os error 105) and replaced it with (), and used tcp1_socket.set_congestion_control(CongestionControl::Cubic) to set the congestion control. I tested various bandwidths and delays, and in all of them smoltcp is suboptimal.

HKalbasi commented 1 month ago

I found out that 1Gbit and 100Mbit were too huge, my system is able to write on AF_PACKET sockets at 72Mbit/s rate. So I changed the rates to 10Mbit/s and 1Mbit/s. Still os tcp socket is able to use a constant 920Kbit/s rate, but smoltcp's rate fluctuates and becomes zero sometimes like the video I sent above.

I captured the traffic on both sides, and noticed that wireshark marks many packets red with labels Spurious Retransmission and Out of Order in the smoltcp capture but is happy with the os capture.

HKalbasi commented 1 month ago

I investigated a bit and (one part of) the problem seems to be here:

https://github.com/smoltcp-rs/smoltcp/blob/7b125ef6010311ea6cbd432496cace3b299e6b29/src/socket/tcp.rs#L338-L355

If the timer is already in the Retransmission state, and it is not expired, it won't update and so there will be an inevitable retransmission no matter how many acks are already received, and wireshark will mark it as a Spurious Retransmission since its acks are already received.

If I change the code above to:

match *self {
    Timer::Idle { .. } | Timer::FastRetransmit { .. } | Timer::Retransmit { .. } => {
        *self = Timer::Retransmit {
            expires_at: timestamp + delay,
            delay,
        }
    }
    Timer::Close { .. } => (),
}

It will solve the problem and the code becomes able to use all of the bandwidth of a 1Mbit/s link. I'm not sure this is the right thing to do, but I think this part of code needs some action.