Leveraging vectored I/O during serialization

notgull commented 2 years ago

Right now, the Serialize trait looks something like this:

pub trait Serialize {
    type Bytes;
    fn serialize(&self) -> Self::Bytes;
    fn serialize_into(&self, bytes: &mut Vec<u8>);
}

Meanwhile, the Request trait has a method like this (ignoring the existence of fds):

fn serialize(self, extension_opcode: u8) -> Vec<Vec<u8>>

In practice, this usually means that for the request, the "body" (i.e. the non-list part of the request) is serialized into one vector, and the remaining lists are serialized into a Vec<u8> each. While this is good, it means that we're spending time serializing and buffering our data, which is a problem that vectored writes were designed to solve.

In breadx, I was experimenting with a trait design like this:

pub trait Serialize {
    fn num_slices(&self) -> usize;
    // returns remainder in slices and count in bytes
    fn serialize<'a>(&'a self, slices: &mut [IoSlice<'a>]) -> (&mut [IoSlice<'a>], usize);
}

// usage
let num_slices = foo.num_slices();
let mut slices = Vec::with_capacity(num_slices);
slices.resize_with(num_slices, || IoSlice::new(&[]));
foo.serialize(&mut slices);

Where num_slices tells the user how many IoSlices the object will use during serialization, and serialize writes those slices into the given buffer. In my experiments I also added a with_slices method that takes a callback with the mutable IoSlices. By default it used the Vec implementation above, but it could be overridden to use an array and completely avoid heap memory.

The main issue with the above, which I discovered during experimentation, is that requests end up having several fields, and dedicating an I/O slice to each one ends up with needing dozens of I/O slices for memory that's mostly contiguous. However, my solution to that was to use bytemuck to convert raw structures to bytes, like so:

struct FooRequest<'input> {
    foo: u32,
    bar: u32,
    my_list: Cow<'input, [OtherThing]>,
}

// have some conversion method to convert FooRequest to FooRequestSer, like
// maybe an IntoSerializable trait

struct FooRequestSer<'input> {
    body: FooBody,
    my_list: Cow<'input, [OtherThing]>,
}

#[derive(bytemuck::Pod, bytemuck::Zeroable)]
#[repr(C)]
struct FooBody {
    foo: u32,
    bar: u32,
}

// FooRequestSer can be converted into two I/O slices: one for FooBody
// (we can get this using bytemuck::bytes_of()), and one for my_list
// (if OtherThing is plain old data, we can use bytemuck::cast_slice() to
// cast it to a slice of bytes)

What are your thoughts on this? It would certainly be no small amount of work, and I'm not sure it would be worth it, but this is pretty close to what libxcb does through unsafe pointer casting.

psychon commented 2 years ago

Since you are talking about optimisations, I fear I would want some benchmarks on this. Something that uses criterion to serialise a request and compares that against a hand-written solution? What would be a "bad" request where this matters a lot?

and the remaining lists are serialized into a Vec each

What would be a request where this matters a lot? My "go to example" for "requests causing problems with lots of data" would be PutImage, and there we directly provide the raw pixel bytes for I/O without needing to copy them:

https://github.com/psychon/x11rb/blob/e6a7071f8c3f472781815cd1457966fd20533cfd/x11rb-protocol/src/protocol/xproto.rs#L17028

https://github.com/psychon/x11rb/blob/e6a7071f8c3f472781815cd1457966fd20533cfd/x11rb-protocol/src/protocol/xproto.rs#L17076

But this 'input lifetime is a problem that I did not manage to pass through an trait, which is why the serialize method is outside of a trait. Then.... I basically gave up and did the horrible thing of "just copy everything":

https://github.com/psychon/x11rb/blob/e6a7071f8c3f472781815cd1457966fd20533cfd/x11rb-protocol/src/protocol/xproto.rs#L17127-L17136

So, I bet it would be quite simple to write code that does better than this serialize trait impl. Dunno how much the allocations in the non-trait serialized really costs, but it should also be better to do better than that.

GetImage likely is a bad example for this kind of experiments. I just looked through xproto.xml and did not find a single request with more than one list. xkb has SetCompatMap and SetDeviceInfo with two lists each. xinput has XIPassiveGrabDevice and SendExtensionEvent with two lists. Hm, yeah....

Another thing that came to my mind while looking through the XML: This proposal means that we now need #[repr(C)] and have to insert padding where needed etc. Oh and the Request structs need new fields for the automatically-set things like major request number and the length field. That's more complication than I expected...

Anyway, what I really wanted to point at is https://bugs.freedesktop.org/show_bug.cgi?id=23403 and https://gitlab.freedesktop.org/xorg/lib/libxcb/-/issues/36. I think your proposal does not have this problem, but I just wanted to point out that even libxcb did not manage to get (the C equivalent of) the #[repr] business right. As Josh wrote ten years ago:

Yeah. Normally, the X protocol padding rules always agree with compiler padding rules, but not when a struct ends with a smaller-than-4-byte field.

This is one thing that I always liked about x11rb: I can be quite confident that we do not have any "the compiler decided to screw us up"-problems since we do all the parsing/serialising "by hand" (which of course means more code that basically does not do anything....).

So, yeah... dunno what to do, but I'd definitely hope for some strong benchmark results indicating that such a change is worthwhile. After all, we are doing I/O and that should be a lot slower than any of the temporary memory allocations on the way there (except for that useless copy in the current Request trait, that one I consider an obvious bug that I just do not know how to fix).

notgull commented 2 years ago

So, yeah... dunno what to do, but I'd definitely hope for some strong benchmark results indicating that such a change is worthwhile.

Yeah, I definitely think we should get benchmarks up and running first. I just wanted to discuss what a potential trait design could look like.

psychon commented 2 years ago

I just wanted to discuss what a potential trait design could look like.

Sure!

My previous attempts / thoughts can be found in #656 and #657. Quote from #657:

Compared to the original idea from https://github.com/psychon/x11rb/issues/656, I changed the result type to use Vec instead of PiecewiseBuf<'static>. Request sending needs to consume the struct since FDs need to be consumed, but that means that 'static is the only possible life time. At that point, I am not sure whether the "picewise buffers" really help much. We would still have to clone the "big argument" to PutImage to get rid of the 'input lifetime, so putting everything into a single buffer is not all that much worse, I guess.

As always, FDs make everything complicated. ;-)

(In case the above is not clear: For a request with FDs, the request struct contains the FD. Since FDs are neither Copy nor Clone, serialising has to either dup() the FD (ugly and I would count this as "does I/O", so does not belong into x11rb-protocol) or has to consume the request struct. And since it is a trait, this means that all trait implementations have to do this. And thus we cannot borrow from the request struct any more, since it consumed. Everyone is sad.)

Now that I have written this, I wonder whether one could move the problem to the impl:

impl Request for RequestWithFDs { stuff }
impl Request for &RequestWithoutFDs { stuff }

By implementing the trait for a reference to a struct, one can work around the "consumes the struct". However.... I don't think the lifetimes would work out, would they?

notgull commented 2 years ago

With my previous suggestion of my IntoSerializable trait above, I think we could do something to the tune of:

pub trait IntoSerializable {
    type Target: Serializable;

    fn into_serializable(self, fds: &mut Vec<RawFdContainer>) -> Self::Target;
}

Where the caller would move the file descriptors into fds before being converted into a more easily serializable form.

psychon commented 2 years ago

So, for your example from above, that would mean:

struct FooRequest<'input> {
    foo: u32,
    bar: u32,
    fd: RawFdContainer,
    my_list: Cow<'input, [OtherThing]>,
}

struct FooRequestSer<'input> {
    body: FooBody,
    my_list: Cow<'input, [OtherThing]>,
}

impl<'input> IntoSerializable for FooRequest<'input> {
    type Target = FooRequestSer<'input>;
    fn into_serializable(self, fds: fds: &mut Vec<RawFdContainer>) -> Self::Target {
        fds.push(self.fd);
        Self::Target {
            body: FooBody {
                foo: self.foo,
                bar: self.bar,
            },
            my_list: self.my_list
        }
    }
}

#[derive(bytemuck::Pod, bytemuck::Zeroable)]
#[repr(C)]
struct FooBody {
    foo: u32,
    bar: u32,
}

Right? So... first we have FooRequest. IntoSerializable would then remove any FDs from that (and for things without an FD, it could just return self, I guess). Then FooRequestSer implements some Serialize trait that would then split things up again into "pieces" where each piece can somehow be cast to bytes.

This could possibly end up in a lot of code, but might work, I guess. If you want, you can experiment some more with this. A self-contained, compilable example would be nice, I guess.

I wonder how well the compiler would manage to optimise all those "uselessly move data around" away...

Anyway, gotta go.

notgull commented 2 years ago

I wonder how well the compiler would manage to optimise all those "uselessly move data around" away...

I actually did some experiments with this on Godbolt. Anything less than a pointer in size is usually a no-op, and anything more turns into a mov or three.

notgull commented 2 years ago

Another potential benefit I noticed while analyzing libcairo's source code: vectored I/O can be used to effectively crop and image without using an intermediate buffer: https://github.com/freedesktop/cairo/blob/master/src/cairo-xcb-connection-core.c#L146-L224

psychon / x11rb

Leveraging vectored I/O during serialization #687