Open psychon opened 4 years ago
Could it be that this big code size comes from "moving large things into an enum with many variants"?
Edit: Seems like this kind of pattern does indeed generate way too much code: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ee3bd9d4445b7657ee94dfec319a88ee
Edit: Yup. https://stackoverflow.com/questions/39219961/how-to-get-assembly-output-from-building-with-cargo taught me to use cargo rustc --release -- --emit asm
. That is a lot more readable than the objdump
output. Plus, I can actually see the jump table that is used. So this code does indeed have this structure:
preparation
look up entry in jump table
jump
case 1: well, this calls `Error::parse` and is special.
case 2:
shuffle registers around
call `KeyPressEvent::try_from`
lots of code to copy around memory
check if it failed
more memory is copied around
done
[This is 28 lines of assembly for doing almost nothing]
case 3:
Basically the same as case 2, but duplicated. The same for all the following cases
[More code follows, e.g. error handling, handling generic events, handling the "Unknown" case"]
Edit: Hm... This code does not look like it went through compiler optimisations. Nothing is inlined in KeyPressEvent::try_parse
. Does --release
not work for cargo rustc
?
Edit: cargo rustc --release -- -Copt-level=s --emit asm
does better.
Edit: as -o /tmp/t.o the-file-generated-by-the-previous-command.s
can generate an object file. That way, one can look at the binary size (and run size
on some file). Running strip /tmp/t.o
still does some progress: The file size goes from 2.4 MiB to 1.3 MiB.
Thanks for looking into this! It turns out that I wasn't using --release (I had assumed that cargo bloat
defaults to it, because that would make sense...). My test cases have been the examples in druid: clone this and then do cargo bloat --release --crates --example flex --features=x11
.
After adding --release
, x11rb
is taking up 5.5% of .text (before that, it was 12.7%). There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb
, but cargo bloat
only shows x11rb
taking up 50kb. I will look into it more.
You can see the size of individual functions with cargo bloat --release --filter x11rb --example flex --features=x11
. After adding release, the biggest is Event::parse
with 6.8K. The try_parse
s have gone down by a lot (Setup::try_parse
is the biggest, at 1.5K)
There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb,
Random data point to make me feel better: libxcb.so
is 163 KiB here and libxcb-randr.so
has 67 KiB. Thus, switching to x11rb is already a net reduction for you. ;-)
(No, I am not serious; libxcb is already installed "everywhere" and thus does not really count towards the size of your binary.)
Actually, it's worse than that: I need XCBConnection
for cairo interop, so I'm linking libxcb.so
anyway :shrug:
After adding release, the biggest is Event::parse with 6.8K. The try_parses have gone down by a lot (Setup::try_parse is the biggest, at 1.5K)
I am not sure how useful cargo bloat
pointing at random functions really is for us. It seems to me this always points at whatever function llvm chose not to inline into its caller, but whose callees were inlined.
But okay, Setup::try_parse
it is. That function has 373 lines of assembly for me (using objdump -S
again). Most of this code is moving bytes around from somewhere in memory to somewhere on the stack. Perhaps we should just Box
things up to avoid so much copying-around-on-the-stack? I do not really have any good ideas about this.
My attempts at smaller code size so far are documented in #491 and #492. Keeping each idea in its own PR perhaps leads to less of a mess than my comments above. The result is not much of a reduction.
Another random idea: All the event parsing code "likely" is only called by Event::parse
(only "likely", because it is public, but I fail to see why anyone would want to call specific bits directly). Some well-placed #[inline]
annotations could end up inlining all the event parsing into Event::parse
. The resulting "monster function" could then hopefully optimised well enough so that the pointless copies disappear and the parsed events are directly parsed into their final position.
Edit: Google suggests that LTO is a better idea than #[inline]
.
Edit: LTO seems to help. With x11rb = "0.6"
and
[profile.release]
opt-level = 'z' # Optimize for size.
lto = true
a simple println!("Hello world");
results in a 175 KiB binary with cargo build --release
and strip
run on the binary. When parsing an event as follows, the binary has 203 KiB, so only 28 KiB extra (which is still a lot for "doing nothing"). Without LTO, the numbers are 195 KiB / 247 KiB, so 52 KiB extra for x11rb parsing an event.
use x11rb::connection::RequestConnection as _;
use x11rb::rust_connection::RustConnection;
fn out_of_thin_air<T>() -> T {
unsafe {
std::ptr::read_volatile(0x1337 as _)
}
}
fn main() {
if false {
println!("Hello World");
} else {
let conn: RustConnection = out_of_thin_air();
let event_bytes: &[u8] = out_of_thin_air();
println!("{}", conn.parse_event(event_bytes).is_ok());
}
}
Some numbers to see how the new release is doing. I built the xclock_utc
example (let's call that build "normal"). I also did a build with the following in Cargo.toml
(let's call that "optimised"):
[profile.release]
opt-level = 'z'
lto = true
codegen-units = 1
panic = 'abort'
Results are: I need a different benchmark (well, 4k less in one case)
$ ls -Ss | cat
insgesamt 5576
1060 xclock_utc_0.7_normal
1060 xclock_utc_0.6_normal
840 xclock_utc_0.7_optimised
836 xclock_utc_0.6_optimised
512 xclock_utc.stripped_0.6_normal
512 xclock_utc.stripped_0.7_normal
380 xclock_utc.stripped_0.7_optimised
376 xclock_utc.stripped_0.6_optimised
Random data point from https://github.com/neXromancers/shotgun/pull/40 (thanks @9ary ): This is some code that doesn't do any event parsing, so most of what we already looked at in this issue doesn't apply. Only two functions from x11rb show up as being large:
File .text Size Crate Name
0.1% 1.2% 6.0KiB x11rb x11rb::rust_connection::RustConnection::connect
0.1% 1.1% 5.8KiB x11rb_protocol x11rb_protocol::protocol::request_name
request_name()
is used in parsing X11 errors (x11rb_protocol::x11_utils::X11Error::try_parse()
) so that the Debug
impl prints the name of the request and not just some random numbers. I'm actually surprised that this turns into so much binary code with just the randr feature enabled.
And connect()
is likely the result of a lot of inlining. This would then be all the code to parse the $DISPLAY
environment variable, to parse ~/.Xauthority
, to send the connection setup and to receive & parse the Setup
from the X11 server. I'm not saying that this size is fine, but at least I can understand why it has the size that it has.
Edit: Poor man's cargo bloat
:
objdump -S target/release/examples/simple_window | c++filt | awk '/^0/ { print NR - last_start, last; last=$0 ; last_start = NR }' | sort -n
For current master this ends with
2217 0000000000152d50 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2317 00000000000faba0 <x11rb_protocol::protocol::get_request_name::hd6cae530618148e9>:
6627 00000000000fdc10 <x11rb_protocol::protocol::request_name::hdaabd4a928272e17>:
With #838 this ends with
2217 000000000014aff0 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2592 00000000000f9af0 <x11rb_protocol::protocol::get_request_name_internal::h3d92f71a778b3242>:
So... that PR gets rid of the largest function and the second-largest function doesn't get much larger.
(Yes, number of lines of output is a bad proxy for binary size - I am counting the number of instructions here and not the size of the function.)
https://github.com/xi-editor/druid/pull/1025#issuecomment-644448962
I'm not sure I understand this correctly (is it "there are multiple things (e.g.
KeyPressEvent::try_parse
) that are in total 3kb in size" or is it "KeyPressEvent::try_parse
is 3kb in size"?), but I cannot really reproduce.I copied together some self-contained code for `KeyPressEvent::try_parse`
```rust use std::convert::TryInto; pub enum ParseError { ParseError } pub type Window = u32; pub type Pixmap = u32; pub type Cursor = u32; pub type Font = u32; pub type Gcontext = u32; pub type Colormap = u32; pub type Atom = u32; pub type Drawable = u32; pub type Fontable = u32; pub type Bool32 = u32; pub type Visualid = u32; pub type Timestamp = u32; pub type Keysym = u32; pub type Keycode = u8; pub type Keycode32 = u32; pub type Button = u8; /// A type implementing this trait can be parsed from some raw bytes. pub trait TryParse: Sized { /// Try to parse the given values into an instance of this type. /// /// If parsing is successful, an instance of the type and a slice for the remaining data should /// be returned. Otherwise, an error is returned. fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError>; } macro_rules! implement_try_parse { ($t:ty) => { impl TryParse for $t { fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let len = std::mem::size_of::<$t>(); let bytes = value .get(..len) .ok_or(ParseError::ParseError)? .try_into() // TryInto<[u8; len]> .unwrap(); Ok((<$t>::from_ne_bytes(bytes), &value[len..])) } } }; } impl TryParse for bool { fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let (data, remaining) = u8::try_parse(value)?; Ok((data != 0, remaining)) } } implement_try_parse!(u8); implement_try_parse!(i8); implement_try_parse!(u16); implement_try_parse!(i16); implement_try_parse!(u32); implement_try_parse!(i32); implement_try_parse!(u64); implement_try_parse!(i64); pub struct KeyPressEvent { pub response_type: u8, pub detail: Keycode, pub sequence: u16, pub time: Timestamp, pub root: Window, pub event: Window, pub child: Window, pub root_x: i16, pub root_y: i16, pub event_x: i16, pub event_y: i16, pub state: u16, pub same_screen: bool, } impl TryParse for KeyPressEvent { fn try_parse(initial_value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let remaining = initial_value; let (response_type, remaining) = u8::try_parse(remaining)?; let (detail, remaining) = Keycode::try_parse(remaining)?; let (sequence, remaining) = u16::try_parse(remaining)?; let (time, remaining) = Timestamp::try_parse(remaining)?; let (root, remaining) = Window::try_parse(remaining)?; let (event, remaining) = Window::try_parse(remaining)?; let (child, remaining) = Window::try_parse(remaining)?; let (root_x, remaining) = i16::try_parse(remaining)?; let (root_y, remaining) = i16::try_parse(remaining)?; let (event_x, remaining) = i16::try_parse(remaining)?; let (event_y, remaining) = i16::try_parse(remaining)?; let (state, remaining) = u16::try_parse(remaining)?; let (same_screen, remaining) = bool::try_parse(remaining)?; let remaining = remaining.get(1..).ok_or(ParseError::ParseError)?; let result = KeyPressEvent { response_type, detail, sequence, time, root, event, child, root_x, root_y, event_x, event_y, state, same_screen }; let _ = remaining; let remaining = initial_value.get(32..) .ok_or(ParseError::ParseError)?; Ok((result, remaining)) } } ```The resulting compiler output with
-Copt-level=3 --edition=2018
is 41 KiB of text (according to https://rust.godbolt.org).I cannot easily see the binary size, but here is the assembly for `KeyPressEvent::try_parse`:
```rustThat's just 90 lines of assembly and it does not call any other code. This can't be 3 KiB of binary code.
Without optimisation, the output is a lot more ugly, but I do not think that looking at this output makes sense.
One thing I notice: llvm managed to merge all the error handling, but it does not notice that it can simplify
if length < 4 then goto error; if length < 8 then goto error;
etc. Addingif initial_value.len() < 32 { return Err(ParseError::ParseError); }
as a new first line toKeyPressEvent::try_parse
helps here. The assembly now only has 56 lines. There are some simplifications that I do not immediately understand, but all of this "cmp
with small number, then jump" was merged into a singlecmp rdx, 31
. I guess generating something like this "everywhere" in the code generator shouldn't be too hard and should help a lot.For the timeline: Optimisation just for the sake of optimisation is hard. It makes more sense to take "size of some program" as a measurement. Thus, I suggest not to merge anything on this before the release and instead proceed carefully.
A goal for optimisation might be to take one of the examples in this repo and check their binary size. For example
cargo build --release --example xclock_utc
results in a 7.3 MiB binary whichstrip
turns into 503 KiB. After the following patch, this turns into 7.3 MiB and 499 KiB. That's already 4 KiB less, just by adding more code to the generated code. :-)CC @jneem I'd be happy about your input here. (And I have never worked with
cargo bloat
before.) One quick question: Did you usecargo build --release
? Or did I perhaps misunderstand you?