Can we get smaller binary code?

psychon commented 4 years ago

https://github.com/xi-editor/druid/pull/1025#issuecomment-644448962

According to cargo bloat, Event::parse is more than 9kb of code (this is only with randr). Then there's also a bunch of things like KeyPressEvent::try_parse at around the 3kb mark. I bet that rustc is doing a lot of inlining, and that the error-handling code could be de-duplicated.

I'm not sure I understand this correctly (is it "there are multiple things (e.g. KeyPressEvent::try_parse) that are in total 3kb in size" or is it "KeyPressEvent::try_parse is 3kb in size"?), but I cannot really reproduce.

I copied together some self-contained code for `KeyPressEvent::try_parse`

```rust use std::convert::TryInto; pub enum ParseError { ParseError } pub type Window = u32; pub type Pixmap = u32; pub type Cursor = u32; pub type Font = u32; pub type Gcontext = u32; pub type Colormap = u32; pub type Atom = u32; pub type Drawable = u32; pub type Fontable = u32; pub type Bool32 = u32; pub type Visualid = u32; pub type Timestamp = u32; pub type Keysym = u32; pub type Keycode = u8; pub type Keycode32 = u32; pub type Button = u8; /// A type implementing this trait can be parsed from some raw bytes. pub trait TryParse: Sized { /// Try to parse the given values into an instance of this type. /// /// If parsing is successful, an instance of the type and a slice for the remaining data should /// be returned. Otherwise, an error is returned. fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError>; } macro_rules! implement_try_parse { ($t:ty) => { impl TryParse for $t { fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let len = std::mem::size_of::<$t>(); let bytes = value .get(..len) .ok_or(ParseError::ParseError)? .try_into() // TryInto<[u8; len]> .unwrap(); Ok((<$t>::from_ne_bytes(bytes), &value[len..])) } } }; } impl TryParse for bool { fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let (data, remaining) = u8::try_parse(value)?; Ok((data != 0, remaining)) } } implement_try_parse!(u8); implement_try_parse!(i8); implement_try_parse!(u16); implement_try_parse!(i16); implement_try_parse!(u32); implement_try_parse!(i32); implement_try_parse!(u64); implement_try_parse!(i64); pub struct KeyPressEvent { pub response_type: u8, pub detail: Keycode, pub sequence: u16, pub time: Timestamp, pub root: Window, pub event: Window, pub child: Window, pub root_x: i16, pub root_y: i16, pub event_x: i16, pub event_y: i16, pub state: u16, pub same_screen: bool, } impl TryParse for KeyPressEvent { fn try_parse(initial_value: &[u8]) -> Result<(Self, &[u8]), ParseError> { let remaining = initial_value; let (response_type, remaining) = u8::try_parse(remaining)?; let (detail, remaining) = Keycode::try_parse(remaining)?; let (sequence, remaining) = u16::try_parse(remaining)?; let (time, remaining) = Timestamp::try_parse(remaining)?; let (root, remaining) = Window::try_parse(remaining)?; let (event, remaining) = Window::try_parse(remaining)?; let (child, remaining) = Window::try_parse(remaining)?; let (root_x, remaining) = i16::try_parse(remaining)?; let (root_y, remaining) = i16::try_parse(remaining)?; let (event_x, remaining) = i16::try_parse(remaining)?; let (event_y, remaining) = i16::try_parse(remaining)?; let (state, remaining) = u16::try_parse(remaining)?; let (same_screen, remaining) = bool::try_parse(remaining)?; let remaining = remaining.get(1..).ok_or(ParseError::ParseError)?; let result = KeyPressEvent { response_type, detail, sequence, time, root, event, child, root_x, root_y, event_x, event_y, state, same_screen }; let _ = remaining; let remaining = initial_value.get(32..) .ok_or(ParseError::ParseError)?; Ok((result, remaining)) } } ```

The resulting compiler output with -Copt-level=3 --edition=2018 is 41 KiB of text (according to https://rust.godbolt.org).

I cannot easily see the binary size, but here is the assembly for `KeyPressEvent::try_parse`:

```rust ::try_parse: push rbp push r15 push r14 push r13 push r12 push rbx sub rsp, 1 mov rax, rdi test rdx, rdx je .LBB1_16 cmp rdx, 1 je .LBB1_16 mov rcx, rdx and rcx, -2 cmp rcx, 2 je .LBB1_16 mov rdi, rdx and rdi, -4 cmp rdi, 4 je .LBB1_16 cmp rdi, 8 je .LBB1_16 cmp rdi, 12 je .LBB1_16 cmp rdi, 16 je .LBB1_16 cmp rcx, 20 je .LBB1_16 cmp rcx, 22 je .LBB1_16 cmp rcx, 24 je .LBB1_16 cmp rcx, 26 je .LBB1_16 cmp rcx, 28 je .LBB1_16 cmp rdx, 30 je .LBB1_16 cmp byte ptr [rsi + 30], 0 setne cl cmp rdx, 31 je .LBB1_16 cmp rdx, 32 jae .LBB1_15 .LBB1_16: mov byte ptr [rax + 30], 2 .LBB1_17: add rsp, 1 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret .LBB1_15: mov r9b, byte ptr [rsi] mov bl, byte ptr [rsi + 1] mov byte ptr [rsp], bl movzx r10d, word ptr [rsi + 2] mov edi, dword ptr [rsi + 4] mov r11d, dword ptr [rsi + 8] mov ebx, dword ptr [rsi + 12] mov ebp, dword ptr [rsi + 16] movzx r14d, word ptr [rsi + 20] movzx r15d, word ptr [rsi + 22] movzx r12d, word ptr [rsi + 24] movzx r13d, word ptr [rsi + 26] movzx r8d, word ptr [rsi + 28] add rsi, 32 add rdx, -32 mov dword ptr [rax], edi mov dword ptr [rax + 4], r11d mov dword ptr [rax + 8], ebx mov dword ptr [rax + 12], ebp mov word ptr [rax + 16], r10w mov word ptr [rax + 18], r14w mov word ptr [rax + 20], r15w mov word ptr [rax + 22], r12w mov word ptr [rax + 24], r13w mov word ptr [rax + 26], r8w mov byte ptr [rax + 28], r9b mov bl, byte ptr [rsp] mov byte ptr [rax + 29], bl mov byte ptr [rax + 30], cl mov qword ptr [rax + 32], rsi mov qword ptr [rax + 40], rdx jmp .LBB1_17 ```

That's just 90 lines of assembly and it does not call any other code. This can't be 3 KiB of binary code.

Without optimisation, the output is a lot more ugly, but I do not think that looking at this output makes sense.

One thing I notice: llvm managed to merge all the error handling, but it does not notice that it can simplify if length < 4 then goto error; if length < 8 then goto error; etc. Adding if initial_value.len() < 32 { return Err(ParseError::ParseError); } as a new first line to KeyPressEvent::try_parse helps here. The assembly now only has 56 lines. There are some simplifications that I do not immediately understand, but all of this "cmp with small number, then jump" was merged into a single cmp rdx, 31. I guess generating something like this "everywhere" in the code generator shouldn't be too hard and should help a lot.

For the timeline: Optimisation just for the sake of optimisation is hard. It makes more sense to take "size of some program" as a measurement. Thus, I suggest not to merge anything on this before the release and instead proceed carefully.

A goal for optimisation might be to take one of the examples in this repo and check their binary size. For example cargo build --release --example xclock_utc results in a 7.3 MiB binary which strip turns into 503 KiB. After the following patch, this turns into 7.3 MiB and 499 KiB. That's already 4 KiB less, just by adding more code to the generated code. :-)

diff --git a/generator/src/generator/namespace.rs b/generator/src/generator/namespace.rs
index c54f996..826744b 100644
--- a/generator/src/generator/namespace.rs
+++ b/generator/src/generator/namespace.rs
@@ -2560,6 +2560,9 @@ impl<'ns, 'c> NamespaceGenerator<'ns, 'c> {
                     if parse_size_constraint != StructSizeConstraint::None {
                         outln!(out, "let remaining = initial_value;");
                     }
+                    if let StructSizeConstraint::Fixed(size) = parse_size_constraint {
+                        outln!(out, "if remaining.len() < {} {{ return Err(ParseError::ParseError); }}", size);
+                    }
                     Self::emit_let_value_for_dynamic_align(fields, out);
                     for field in fields.iter() {
                         self.emit_field_parse(

CC @jneem I'd be happy about your input here. (And I have never worked with cargo bloat before.) One quick question: Did you use cargo build --release? Or did I perhaps misunderstand you?

psychon commented 4 years ago

Random ramblings

> Event::parse is more than 9kb of code (this is only with randr) I took a look at the output of `objdump -S target/release/examples/xclock_utc` (this is without any extensions enabled). Looks like a lot was inlined into this function. For master, I get 1431 lines of assembly code. This is surely less than 9 KiB of binary code, but I do not have randr enabled. With the patch from my last comment... uhm... nothing changed (well, the number of lines did not change). Weird. With `cargo build --release --example xclock_utc --features all-extensions`, this function turns into a 6537-lines-monster. Hm... When scrolling through this, a lot of the code looks like bad versions of `memcpy`, i.e. stuff like this: ``` 2f87e: 8b 8c 24 97 00 00 00 mov 0x97(%rsp),%ecx 2f885: 89 8c 24 67 02 00 00 mov %ecx,0x267(%rsp) 2f88c: 48 8b 8c 24 90 00 00 mov 0x90(%rsp),%rcx 2f893: 00 2f894: 48 89 8c 24 60 02 00 mov %rcx,0x260(%rsp) 2f89b: 00 2f89c: 8b 8c 24 67 02 00 00 mov 0x267(%rsp),%ecx 2f8a3: 89 8c 24 87 01 00 00 mov %ecx,0x187(%rsp) 2f8aa: 48 8b 8c 24 60 02 00 mov 0x260(%rsp),%rcx 2f8b1: 00 2f8b2: 48 89 8c 24 80 01 00 mov %rcx,0x180(%rsp) 2f8b9: 00 2f8ba: 48 8b 8c 24 80 01 00 mov 0x180(%rsp),%rcx 2f8c1: 00 2f8c2: 48 89 8c 24 d3 01 00 mov %rcx,0x1d3(%rsp) 2f8c9: 00 2f8ca: 8b 8c 24 87 01 00 00 mov 0x187(%rsp),%ecx 2f8d1: 89 8c 24 da 01 00 00 mov %ecx,0x1da(%rsp) 2f8d8: 48 8b 8c 24 d0 01 00 mov 0x1d0(%rsp),%rcx 2f8df: 00 2f8e0: 49 89 4e 01 mov %rcx,0x1(%r14) 2f8e4: 48 8b 8c 24 d6 01 00 mov 0x1d6(%rsp),%rcx 2f8eb: 00 2f8ec: 49 89 4e 07 mov %rcx,0x7(%r14) 2f8f0: 41 88 46 0f mov %al,0xf(%r14) ``` Could this come from lines like ` xproto::BUTTON_PRESS_EVENT => return Ok(Self::ButtonPress(event.try_into()?)),` where the compiler copies the result from `event.try_into()?` into the newly constructed enum? Adding the following to `Cargo.toml` ``` [profile.release] opt-level = 'z' ``` turns that 6537-lines-monster into a 3442-lines-monster. And there are now 44 calls to `memcpy` (the old code had 26). No idea where 44 comes from. And the assembly still contains unrolled `memcpy`s... :-/

Could it be that this big code size comes from "moving large things into an enum with many variants"?

Edit: Seems like this kind of pattern does indeed generate way too much code: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ee3bd9d4445b7657ee94dfec319a88ee

Edit: Yup. https://stackoverflow.com/questions/39219961/how-to-get-assembly-output-from-building-with-cargo taught me to use cargo rustc --release -- --emit asm. That is a lot more readable than the objdump output. Plus, I can actually see the jump table that is used. So this code does indeed have this structure:

preparation
look up entry in jump table
jump
case 1: well, this calls `Error::parse` and is special.
case 2:
  shuffle registers around
  call `KeyPressEvent::try_from`
  lots of code to copy around memory
  check if it failed
  more memory is copied around
  done
  [This is 28 lines of assembly for doing almost nothing]
case 3:
  Basically the same as case 2, but duplicated. The same for all the following cases
[More code follows, e.g. error handling, handling generic events, handling the "Unknown" case"]

Edit: Hm... This code does not look like it went through compiler optimisations. Nothing is inlined in KeyPressEvent::try_parse. Does --release not work for cargo rustc? Edit: cargo rustc --release -- -Copt-level=s --emit asm does better. Edit: as -o /tmp/t.o the-file-generated-by-the-previous-command.s can generate an object file. That way, one can look at the binary size (and run size on some file). Running strip /tmp/t.o still does some progress: The file size goes from 2.4 MiB to 1.3 MiB.

jneem commented 4 years ago

Thanks for looking into this! It turns out that I wasn't using --release (I had assumed that cargo bloat defaults to it, because that would make sense...). My test cases have been the examples in druid: clone this and then do cargo bloat --release --crates --example flex --features=x11.

After adding --release, x11rb is taking up 5.5% of .text (before that, it was 12.7%). There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb, but cargo bloat only shows x11rb taking up 50kb. I will look into it more.

You can see the size of individual functions with cargo bloat --release --filter x11rb --example flex --features=x11. After adding release, the biggest is Event::parse with 6.8K. The try_parses have gone down by a lot (Setup::try_parse is the biggest, at 1.5K)

psychon commented 4 years ago

There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb,

Random data point to make me feel better: libxcb.so is 163 KiB here and libxcb-randr.so has 67 KiB. Thus, switching to x11rb is already a net reduction for you. ;-) (No, I am not serious; libxcb is already installed "everywhere" and thus does not really count towards the size of your binary.)

jneem commented 4 years ago

Actually, it's worse than that: I need XCBConnection for cairo interop, so I'm linking libxcb.so anyway :shrug:

psychon commented 4 years ago

After adding release, the biggest is Event::parse with 6.8K. The try_parses have gone down by a lot (Setup::try_parse is the biggest, at 1.5K)

I am not sure how useful cargo bloat pointing at random functions really is for us. It seems to me this always points at whatever function llvm chose not to inline into its caller, but whose callees were inlined.

But okay, Setup::try_parse it is. That function has 373 lines of assembly for me (using objdump -S again). Most of this code is moving bytes around from somewhere in memory to somewhere on the stack. Perhaps we should just Box things up to avoid so much copying-around-on-the-stack? I do not really have any good ideas about this.

My attempts at smaller code size so far are documented in #491 and #492. Keeping each idea in its own PR perhaps leads to less of a mess than my comments above. The result is not much of a reduction.

psychon commented 4 years ago

Another random idea: All the event parsing code "likely" is only called by Event::parse (only "likely", because it is public, but I fail to see why anyone would want to call specific bits directly). Some well-placed #[inline] annotations could end up inlining all the event parsing into Event::parse. The resulting "monster function" could then hopefully optimised well enough so that the pointless copies disappear and the parsed events are directly parsed into their final position.

Edit: Google suggests that LTO is a better idea than #[inline].

Edit: LTO seems to help. With x11rb = "0.6" and

[profile.release]
opt-level = 'z'  # Optimize for size.
lto = true

a simple println!("Hello world"); results in a 175 KiB binary with cargo build --release and strip run on the binary. When parsing an event as follows, the binary has 203 KiB, so only 28 KiB extra (which is still a lot for "doing nothing"). Without LTO, the numbers are 195 KiB / 247 KiB, so 52 KiB extra for x11rb parsing an event.

use x11rb::connection::RequestConnection as _;
use x11rb::rust_connection::RustConnection;

fn out_of_thin_air<T>() -> T {
    unsafe {
        std::ptr::read_volatile(0x1337 as _)
    }
}

fn main() {
    if false {
        println!("Hello World");
    } else {
        let conn: RustConnection = out_of_thin_air();
        let event_bytes: &[u8] = out_of_thin_air();
        println!("{}", conn.parse_event(event_bytes).is_ok());
    }
}

psychon commented 3 years ago

Some numbers to see how the new release is doing. I built the xclock_utc example (let's call that build "normal"). I also did a build with the following in Cargo.toml (let's call that "optimised"):

[profile.release]
opt-level = 'z'
lto = true
codegen-units = 1
panic = 'abort'

Results are: I need a different benchmark (well, 4k less in one case)

$ ls -Ss | cat
insgesamt 5576
1060 xclock_utc_0.7_normal
1060 xclock_utc_0.6_normal
 840 xclock_utc_0.7_optimised
 836 xclock_utc_0.6_optimised
 512 xclock_utc.stripped_0.6_normal
 512 xclock_utc.stripped_0.7_normal
 380 xclock_utc.stripped_0.7_optimised
 376 xclock_utc.stripped_0.6_optimised

psychon commented 1 year ago

Random data point from https://github.com/neXromancers/shotgun/pull/40 (thanks @9ary ): This is some code that doesn't do any event parsing, so most of what we already looked at in this issue doesn't apply. Only two functions from x11rb show up as being large:

 File  .text     Size          Crate Name
 0.1%   1.2%   6.0KiB          x11rb x11rb::rust_connection::RustConnection::connect
 0.1%   1.1%   5.8KiB x11rb_protocol x11rb_protocol::protocol::request_name

request_name() is used in parsing X11 errors (x11rb_protocol::x11_utils::X11Error::try_parse()) so that the Debug impl prints the name of the request and not just some random numbers. I'm actually surprised that this turns into so much binary code with just the randr feature enabled.

And connect() is likely the result of a lot of inlining. This would then be all the code to parse the $DISPLAY environment variable, to parse ~/.Xauthority, to send the connection setup and to receive & parse the Setup from the X11 server. I'm not saying that this size is fine, but at least I can understand why it has the size that it has.

Edit: Poor man's cargo bloat:

objdump -S target/release/examples/simple_window | c++filt | awk '/^0/ { print NR - last_start, last; last=$0 ; last_start = NR }' | sort -n

For current master this ends with

2217 0000000000152d50 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2317 00000000000faba0 <x11rb_protocol::protocol::get_request_name::hd6cae530618148e9>:
6627 00000000000fdc10 <x11rb_protocol::protocol::request_name::hdaabd4a928272e17>:

With #838 this ends with

2217 000000000014aff0 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2592 00000000000f9af0 <x11rb_protocol::protocol::get_request_name_internal::h3d92f71a778b3242>:

So... that PR gets rid of the largest function and the second-largest function doesn't get much larger.

(Yes, number of lines of output is a bad proxy for binary size - I am counting the number of instructions here and not the size of the function.)

psychon / x11rb

Can we get smaller binary code? #488