smoltcp-rs / smoltcp

a smol tcp/ip stack
BSD Zero Clause License
3.71k stars 412 forks source link

Better buffer structures for TCP sockets #359

Open jordens opened 4 years ago

jordens commented 4 years ago

Currently the SocketBuffer for the RX and TX side of the TCP sockets are RingBuffers. This is unfortunate since they will give access to sub-optimal partial chunks once the data/space spans the wrap-around. The latter happens often in practice (at roughly packet/message size divided by buffer size). There is currently neither a smoltcp API to get access to the entire free space (for send()) or buffered data (for recv()) nor would any higher protocol layer typically support such an API (you'd have to deal with two chunks). Since the upper protocol needs to always recv/send the partial chunk up to the wrap-around before it will get access to the remaining chunk, it will need to buffer and copy. It shouldn't need to do that since there is either more space/data available or it can get there by merely waiting a bit and trying again later if the available space/data is insufficient.

There seem to better data structures that would allow protocols on top of TCP to be implemented without having to buffer again at the protocol layer. Maybe bip buffers are such a structure. The interface wouldn't even be very different from RingBuffer. I don't know what other stacks are using.

whitequark commented 4 years ago

Wouldn't bip buffers require paging and use unsafe code? I'm not sure I like having either of those in smoltcp core, since being written in 100% safe Rust and not requiring anything weird from the platform are two of its selling points.

Having those as an option might be OK, but I'm also not sure if there is a way to do that cleanly.

whitequark commented 4 years ago

A possible alternative is an API that uses a list of chunks instead of a single chunk, similar to iovec.

jordens commented 4 years ago

Is there really no unsafe code in the existing dependencies? byteorder seems to have quite some. I suspect the bbqueue or spsc-bip-buffer authors would say they do it cleanly and that the unsafe code is fully analyzed and sound. I have no idea about iovec. Would it also allow you to just wait and try again later if the available space isn't large enough yet? An API that gives you just one, ultimately large enough, chunk to handle (to serialize your packet into) is seems very appealing.

whitequark commented 4 years ago

Is there really no unsafe code in the existing dependencies? byteorder seems to have quite some. I suspect the bbqueue or spsc-bip-buffer authors would say they do it cleanly and that the unsafe code is fully analyzed and sound.

Fair point. We can do that.

An API that gives you just one, ultimately large enough, chunk to handle (to serialize your packet into) is seems very appealing.

I gave bbqueue a quick look, seems OK to me overall. Can we make it an opt-in feature without breaking the overall API? If so (seems like we can, but maybe I'm missing something), I'm happy to merge this.

jordens commented 4 years ago

Yes I think this wouldn't change any API and it can be optional. But let's get some more input on this. I don't feel like the expert here.

ryan-summers commented 4 years ago

Just to add to this discussion - we're working on an MQTT implementation on top of smoltcp which uses a TCP socket as the underlying transport mechanism.

From our point of view, a bip buffer would play very nicely. Potentially, our API would transform from

  1. Construct MQTT packet in local buffer (stack/heap)
  2. Write data to TCP socket
  3. (Internally) copy packet buffer into TCP socket buffer

into:

  1. Request buffer from the TCP socket
  2. Construct MQTT packet in-place in the requested buffer
  3. Commit changes to the TCP socket

This has a number of advantages:

  1. We don't have to worry about potentially writing half a packet to the TCP socket due to insufficient space in the underlying socket (Although this can be done with a check on the available size, we're using a generic trait implementation of embedded-nal, which doesn't allow us to check the underlying buffer size)
  2. We eliminate a copy of the data, which speeds up execution of the protocol, which increases throughput.

I'd be happy to look into implementation on this if there's interest.

dnadlinger commented 4 years ago

Not to detract from the (potential) merits of contiguous buffers and avoiding copies, but I don't quite see the issue with

We don't have to worry about potentially writing half a packet to the TCP socket.

After all, TCP is stream-based, not message-based, so receivers can deal just fine with the second half arriving a while later when there was buffer space available for it to be sent.

ryan-summers commented 4 years ago

@dnadlinger From a higher-level perspective, this means the layer above the TCP socket then has to keep state about the half-a-packet it sent (if the high-level protocol doesn't support invalid packet correction/detection, or if the half-packet causes some form of sync-loss). This then forces the upper layer to keep a buffer for the packet around, even if it doesn't necessarily want to.

Basically, with the proposed implementation, the higher level layer knows in advance whether or not it can send the datagram it wants to. It doesn't have to use a "try-then-handle" if it wasn't able to do the whole process.

Ventilador commented 1 month ago

I'm sorry for reviving this, but i found this issue while searching an issue i had with abusing the fact that the read buffer is exposed, and creating packets on the fly without copy/allocs. It took me a while to understand why my process was getting stuck some times. Anyways

I've implemented a defrag method on the RingBuffer, and at least the tcp buffers works. you can see a very rudimentary implementation here

i'd be happy to create a PR with it and/or discuss API alternatives. I've tested (manually on my "app") all 4 cases and they seem to work perfectly, not bytes lost at all. Even when the buffers are full (the two cases on expensive)

Also thanks for the lib, its awesome

whitequark commented 1 month ago

i found this issue while searching an issue i had with abusing the fact that the read buffer is exposed, and creating packets on the fly without copy/allocs

This is not abuse of the API but rather an intentional feature!

Regarding "defrag", I would suggest restructuring your higher-level application to not require contiguous slices, like using a pattern where you feed the protocol data to a writer piece by piece after making sure that you have enough total free space. This mirrors how reads must be done anyway, so I think most applications should be able to handle that.

Ventilador commented 1 month ago

i honestly dont know how could i do it in my case, without a lot of complexity. if you are not interested in my problem, please ignore.

i have a laptop from work with a vpn, but i want to keep using my pc, so i wrote my own vpn i have smoltcp setup as a tun device, with a bunch of tcp sockets listening on port 443 (its the only one i care), and i've configured iptables to route specific ips (the ones from work) to my tun device i then also have another socket listening on a "random" port like 9000 or something, and a regular std::net::TcpStream connects to from the laptop (so i dont need admin rights, it has windows), and the two of them talk with it. the tun device receives a new connection for one of the known ips, sends a packet to the laptop to open the same connection there and then route the traffic while its open. i also use this (with another tcp socket) to do some fs stuff (like send/receive folders) and run shell commands

i've made a macro for serialize/deserialize into/from &[u8], for example a &str write 4 bytes for its length and then the content, enums use 1 u8 + whatever they carry and so on, because of this i can do something like this (roughly)

on every loop i pop the main_soc from the SocketSet, so i can use it while using the other ones

fn move_data(main_soc: &mut Socket, other: &mut Socket) {
    main_soc.send(|write_buffer| {
        let wrote = other.recv(|read_buffer| {
            let packet = Packet::Payload(ip, port, read_bufer);
            let to_write: usize = packet.to_write();
            if write_buffer.len() > to_write {
                packet.write_to(write_buffer);
                (read_buffer.len(), to_write)
            } else {
                (0, 0)
            }
        }).unwrap();
        (wrote, ())
    }).unwrap()
}

and

fn move_data(main_soc: &mut Socket, others: &mut Vec<Socket>) {
    main_soc.recv(|mut read_buffer| {
        let mut acc = 0;
        while let Some((used, packet)) = Packet::parse(read_buf) {
            match packet {
                Packet::Payload(ip, port, payload) => {
                    let soc = others.find(port, ip);
                    let sent =  soc.send(|write| {
                        if payload.len() >= write.len() {
                            write.copy_from_slice(payload);
                            (write.len(), true)
                        }else {
                            (0, false)
                        }
                    }).unwrap();
                    if sent {
                        acc+=used;
                        read_buffer = &mut read_buffer[used..];
                    } else {
                        break;
                    }
                }
                Packet::Open(...) => ...,
                Packet::Fs(...) => ...,
            }
        }
        (acc, ())
    }).unwrap()
}

i have a max limit on the packet sizes (way less than buf sizes), but i can still end up with half a packet at the end of the ring. before implementing this, i had an extra buffer, and it would copy the last bytes, but then i need to copy the next read_buf into that cache buf to be able to have a packet in a contiguous slice, which makes it very convoluted.

i hope that made sense

whitequark commented 1 month ago

Oh, so it's not an embedded use case? If you're on a PC I would propose allocating, since that's not a huge deal. I assumed that you are on something like an MCU with 4K of RAM initially, where a bigger buffer can be a deal-breaker.

Ventilador commented 1 month ago

oh no, sorry, im just doing it this way because im have problems, i wanted to keep allocation to a minimun and buf sizes small, just "to make it fun"