Open notgull opened 2 years ago
Since you are talking about optimisations, I fear I would want some benchmarks on this. Something that uses criterion to serialise a request and compares that against a hand-written solution? What would be a "bad" request where this matters a lot?
and the remaining lists are serialized into a Vec
each
What would be a request where this matters a lot? My "go to example" for "requests causing problems with lots of data" would be PutImage
, and there we directly provide the raw pixel bytes for I/O without needing to copy them:
But this 'input
lifetime is a problem that I did not manage to pass through an trait, which is why the serialize
method is outside of a trait. Then.... I basically gave up and did the horrible thing of "just copy everything":
So, I bet it would be quite simple to write code that does better than this serialize
trait impl. Dunno how much the allocations in the non-trait serialized
really costs, but it should also be better to do better than that.
GetImage likely is a bad example for this kind of experiments. I just looked through xproto.xml
and did not find a single request with more than one list. xkb
has SetCompatMap
and SetDeviceInfo
with two lists each. xinput
has XIPassiveGrabDevice
and SendExtensionEvent
with two lists. Hm, yeah....
Another thing that came to my mind while looking through the XML: This proposal means that we now need #[repr(C)]
and have to insert padding where needed etc. Oh and the Request
structs need new fields for the automatically-set things like major request number and the length field. That's more complication than I expected...
Anyway, what I really wanted to point at is https://bugs.freedesktop.org/show_bug.cgi?id=23403 and https://gitlab.freedesktop.org/xorg/lib/libxcb/-/issues/36. I think your proposal does not have this problem, but I just wanted to point out that even libxcb did not manage to get (the C equivalent of) the #[repr]
business right. As Josh wrote ten years ago:
Yeah. Normally, the X protocol padding rules always agree with compiler padding rules, but not when a struct ends with a smaller-than-4-byte field.
This is one thing that I always liked about x11rb: I can be quite confident that we do not have any "the compiler decided to screw us up"-problems since we do all the parsing/serialising "by hand" (which of course means more code that basically does not do anything....).
So, yeah... dunno what to do, but I'd definitely hope for some strong benchmark results indicating that such a change is worthwhile. After all, we are doing I/O and that should be a lot slower than any of the temporary memory allocations on the way there (except for that useless copy in the current Request
trait, that one I consider an obvious bug that I just do not know how to fix).
So, yeah... dunno what to do, but I'd definitely hope for some strong benchmark results indicating that such a change is worthwhile.
Yeah, I definitely think we should get benchmarks up and running first. I just wanted to discuss what a potential trait design could look like.
I just wanted to discuss what a potential trait design could look like.
Sure!
My previous attempts / thoughts can be found in #656 and #657. Quote from #657:
Compared to the original idea from https://github.com/psychon/x11rb/issues/656, I changed the result type to use Vec
instead of PiecewiseBuf<'static>. Request sending needs to consume the struct since FDs need to be consumed, but that means that 'static is the only possible life time. At that point, I am not sure whether the "picewise buffers" really help much. We would still have to clone the "big argument" to PutImage to get rid of the 'input lifetime, so putting everything into a single buffer is not all that much worse, I guess.
As always, FDs make everything complicated. ;-)
(In case the above is not clear: For a request with FDs, the request struct contains the FD. Since FDs are neither Copy
nor Clone
, serialising has to either dup()
the FD (ugly and I would count this as "does I/O", so does not belong into x11rb-protocol) or has to consume the request struct. And since it is a trait, this means that all trait implementations have to do this. And thus we cannot borrow from the request struct any more, since it consumed. Everyone is sad.)
Now that I have written this, I wonder whether one could move the problem to the impl
:
impl Request for RequestWithFDs { stuff }
impl Request for &RequestWithoutFDs { stuff }
By implementing the trait for a reference to a struct, one can work around the "consumes the struct". However.... I don't think the lifetimes would work out, would they?
With my previous suggestion of my IntoSerializable
trait above, I think we could do something to the tune of:
pub trait IntoSerializable {
type Target: Serializable;
fn into_serializable(self, fds: &mut Vec<RawFdContainer>) -> Self::Target;
}
Where the caller would move the file descriptors into fds
before being converted into a more easily serializable form.
So, for your example from above, that would mean:
struct FooRequest<'input> {
foo: u32,
bar: u32,
fd: RawFdContainer,
my_list: Cow<'input, [OtherThing]>,
}
struct FooRequestSer<'input> {
body: FooBody,
my_list: Cow<'input, [OtherThing]>,
}
impl<'input> IntoSerializable for FooRequest<'input> {
type Target = FooRequestSer<'input>;
fn into_serializable(self, fds: fds: &mut Vec<RawFdContainer>) -> Self::Target {
fds.push(self.fd);
Self::Target {
body: FooBody {
foo: self.foo,
bar: self.bar,
},
my_list: self.my_list
}
}
}
#[derive(bytemuck::Pod, bytemuck::Zeroable)]
#[repr(C)]
struct FooBody {
foo: u32,
bar: u32,
}
Right? So... first we have FooRequest
. IntoSerializable
would then remove any FDs from that (and for things without an FD, it could just return self
, I guess). Then FooRequestSer
implements some Serialize
trait that would then split things up again into "pieces" where each piece can somehow be cast to bytes.
This could possibly end up in a lot of code, but might work, I guess. If you want, you can experiment some more with this. A self-contained, compilable example would be nice, I guess.
I wonder how well the compiler would manage to optimise all those "uselessly move data around" away...
Anyway, gotta go.
I wonder how well the compiler would manage to optimise all those "uselessly move data around" away...
I actually did some experiments with this on Godbolt. Anything less than a pointer in size is usually a no-op, and anything more turns into a mov
or three.
Another potential benefit I noticed while analyzing libcairo's source code: vectored I/O can be used to effectively crop and image without using an intermediate buffer: https://github.com/freedesktop/cairo/blob/master/src/cairo-xcb-connection-core.c#L146-L224
Right now, the
Serialize
trait looks something like this:Meanwhile, the
Request
trait has a method like this (ignoring the existence of fds):In practice, this usually means that for the request, the "body" (i.e. the non-list part of the request) is serialized into one vector, and the remaining lists are serialized into a
Vec<u8>
each. While this is good, it means that we're spending time serializing and buffering our data, which is a problem that vectored writes were designed to solve.In
breadx
, I was experimenting with a trait design like this:Where
num_slices
tells the user how manyIoSlice
s the object will use during serialization, andserialize
writes those slices into the given buffer. In my experiments I also added awith_slices
method that takes a callback with the mutableIoSlice
s. By default it used theVec
implementation above, but it could be overridden to use an array and completely avoid heap memory.The main issue with the above, which I discovered during experimentation, is that requests end up having several fields, and dedicating an I/O slice to each one ends up with needing dozens of I/O slices for memory that's mostly contiguous. However, my solution to that was to use bytemuck to convert raw structures to bytes, like so:
What are your thoughts on this? It would certainly be no small amount of work, and I'm not sure it would be worth it, but this is pretty close to what libxcb does through unsafe pointer casting.