x11rb-protocol takes very long to build

kchibisov commented 1 year ago

During profiling both debug and release builds of both alacritty and winit, I've noticed that x11rb-protocol builds around 5x longer than most massive deps like (serde).

And if you happen to have 2 copies of x11rb-protocol in your tree, it'll really slow things down.

For example alacritty on my system builds in around 20seconds, and each of the x11rb-protocol takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).

I was testing with the cargo build --timings and cargo build --release --timings on the tip of the alacritty/alacritty repo. You could do that on the winit repo alone for example.

It was used on both rustc 1.72.0 and rustc 1.74.0-nightly (0288f2e19 2023-09-25) showing the exact same compilation times.

psychon commented 1 year ago

Random guess would be "lots of XML for protocol stuff resulting in even more generated Rust code that rustc has to cope with".

For example alacritty on my system builds in around 20seconds, and each of the x11rb-protocol takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).

I was curious about enabled features, but that is apparently not so easy to figure out just with a web browser.

According to https://github.com/alacritty/alacritty/blob/a58fb39b68caa34b073f66911c0ac6945f56eac2/Cargo.lock, a dependency on x11rb 0.12.0 comes from winit. x11-clipboard depends on x11rb 0.10.1

x11-clipboard just enables xfixes (and apparently has already been updated to 0.12.0 in git): https://github.com/quininer/x11-clipboard/blob/c7fd2a93c93a7690478953db87de849d47ce9f2a/Cargo.toml#L12

winit enables allow-unsafe-code, dl-libxcb, randr, resource_manager, xinput, xkb: https://github.com/rust-windowing/winit/blob/master/Cargo.toml#L155

I am ignoring the non-X11-extension features (since I kinda expect the generated code to be the problem since it really is huge). This leads to the following files being enabled to build in x11rb-protocol/src/protocol/:

135.214 bytes / 3.645 lines in xfixes.rs
285.344 bytes / 7.352 lines in randr.rs
584.999 bytes / 14.709 lines in xkb.rs
742.184 bytes / 19.405 lines in xinput.rs
947.122 bytes / 25.897 lines in xproto.rs For a grand total of of 2.694.863 bytes / 71.008 lines of generated code. (mod.rs in that file also has some more lines that are gates behind these features.)

And random time measurements with cargo check (which of course does less than cargo build, but I am just trying to measure the influence of the extensions). I am running cargo clean && time cargo check -p x11rb-protocol $EXTRA_STUFF:

--no-default-features: 4.9 seconds of wall clock time
default features: 5.0 seconds of wall clock time
--features resource_manager: 5.3 seconds
--features xfixes,randr,xkb,xinput: 15.3 seconds

So, to me this is clearly due to the huge generated code files.

However, that still does not provide any hint on what to do about this. Adding more feature flags doesn't sound like a workable solution.

kchibisov commented 1 year ago

I guess the only way is to reduce the amount of things being generated, since for example wayland stuff is pretty fast to generate on the other side and we use a lot of protocols.

kchibisov commented 1 year ago

I'm not familiar with X11 internals though to say whether what you generate makes sense or not in the end of the day, but the amount of code you say sounds like really a lot (it's like alarcitty + glutin + winit + some wayland crates combined)... Especially when those types have derive impls, etc, etc.

One technique to reduce code with generated source code is outlining, but I'm not sure how it's applicable.

psychon commented 1 year ago

wayland-protocols seems to use wayland-scanner to generate code for the XML from another thingie called wayland-protocols. The docs for wayland-scanner do not really say / show what it generates.

I don't know much about Wayland, but there does not seem to be anything like nested structs in there. I hacked together a quick Python script to show which XML tags exist and found ['arg', 'copyright', 'description', 'entry', 'enum', 'event', 'interface', 'protocol', 'request']. The same script run on the XML from xcb-proto finds ['allowed', 'bit', 'bitcase', 'brief', 'case', 'description', 'doc', 'enum', 'enumref', 'error', 'errorcopy', 'event', 'eventcopy', 'eventstruct', 'example', 'exprfield', 'fd', 'field', 'fieldref', 'import', 'item', 'length', 'list', 'listelement-ref', 'op', 'pad', 'paramref', 'popcount', 'reply', 'request', 'required_start_align', 'see', 'struct', 'sumof', 'switch', 'type', 'typedef', 'union', 'unop', 'value', 'xcb', 'xidtype', 'xidunion'].

Perhaps wayland was actually designed to be simple. X11 certainly was not.

One technique to reduce code with generated source code is outlining, but I'm not sure how it's applicable.

https://github.com/psychon/x11rb/blob/master/doc/generated_code.md has some introduction to the generated code and it shows how much we generate. Random example: <enum>s do not have a size in the XML (because C is so lenient about integer casts, I guess...), so for each enum, we generate code to convert to/from ever integer type that may fit, e.g. u8, u16, u32. According to the example on that page, one <enum> turned into seven <impl> blocks.

I feel like cutting down on the build time here means cutting down on features. :-(

psychon commented 1 year ago

https://fasterthanli.me/articles/why-is-my-rust-build-so-slow suggested some crimes: RUSTC_BOOTSTRAP=1 cargo rustc -p x11rb-protocol --features xfixes,randr,xkb,xinput --verbose -- -Z self-profile

summarize looks like this is just lots of code and thus taking long (first 20 lines of output):

+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen_emit_obj                        | 13.51s    | 29.391          | 13.51s   | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                              | 5.04s     | 10.966          | 5.44s    | 17024      | 79.56ms                         |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_passes                                         | 4.72s     | 10.274          | 4.72s    | 1          | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| codegen_module                                      | 4.54s     | 9.873           | 5.24s    | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                        | 2.77s     | 6.028           | 5.29s    | 17024      | 6.54ms                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen                                 | 1.50s     | 3.263           | 15.01s   | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_drops_elaborated_and_const_checked              | 1.36s     | 2.949           | 1.56s    | 17029      | 2.96ms                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| optimized_mir                                       | 973.42ms  | 2.118           | 2.53s    | 14807      | 140.88ms                        |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                           | 936.56ms  | 2.038           | 1.39s    | 17024      | 235.99ms                        |

The SVG from flamegraph is not much more interesting. A third of the time is analyis, which is stuff like borrow checking and type checking. Most everything else sounds like codegen.

The chrome://tracing thingie didn't work. crox produced a 1006 MiB JSON, but chromium didn't like it. On ui.perfetto.dev, I got an out-of-memory error.

notgull commented 1 year ago

It seems that the issue is the raw amount of code generated being passed to LLVM. It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq and Ord ones, which are rarely used). I think that might alleviate the issue, at least partially.

kchibisov commented 1 year ago

This won't help unfortunately. The only thing that could probably help is speculating on the data and reducing the amount of matches you have. Like if you look at common code it has lots of very long matches doing basically the same thing as getting the same field from various event structs. From what I remember with X11 they sort of share the place in the struct when it comes to byte offset, because the X11 API from what I remember in Xlib uses a bunch of unions to model things.

Maybe with the help of some unsafe and speculations on how things are located in memory it could reduce at least giant matches?

Probably don't need a bunch of impls of from basic integer types for bitmasks, since you can obviously cast them and then call From.

The issue is not only about the build times, but the amount of code ending in the binary. The stuff like Eq and Ord won't help because they simply will be optimized out.

notgull commented 1 year ago

Idea similar to one I tried in winit: replace the match statement with a per-display HashMap that's keyed to take some part of the bytes and then translate that to a function pointer that parses the bytes as the relevant request/response/event/whatever. It will shift some part of the complexity from compile time to runtime (although that may be a good thing), but it should eliminate these huge matches.

psychon commented 1 year ago

I made an attempt to delete all code related to parsing requests (we normally only have to serialise them, but our xtrace example uses code for parsing; this seemed like something that could be hidden behind a feature flag). The time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput measurement from above says 13 seconds, so this saves about 2 seconds.

The result does not properly build (e.g. x11rb is broken), but at least x11rb-protocol builds fine.

The patch for the code generator

```diff diff --git a/generator/src/generator/namespace/request.rs b/generator/src/generator/namespace/request.rs index f9d9d32..72d4d9d 100644 --- a/generator/src/generator/namespace/request.rs +++ b/generator/src/generator/namespace/request.rs @@ -193,6 +193,7 @@ pub(super) fn generate_request( header = generator.ns.header, lifetime = lifetime_block )); + /* if gathered.has_fds() { enum_cases.request_parse_cases.push(format!( "{header}::{opcode_name}_REQUEST => return \ @@ -214,6 +215,7 @@ pub(super) fn generate_request( name = name, )); } + */ emit_request_function( generator, request_def, @@ -299,7 +301,7 @@ pub(super) fn generate_request( &reply_fields, &[], false, - true, + false, StructSizeConstraint::EmbeddedLength { minimum: 32 }, true, true, @@ -316,6 +318,7 @@ pub(super) fn generate_request( )); } +/* if gathered.needs_lifetime { enum_cases.request_into_owned_cases.push(format!( "Request::{ns_prefix}{name}(req) => Request::{ns_prefix}{name}(req.into_owned()),", @@ -329,6 +332,7 @@ pub(super) fn generate_request( name = name, )); } + */ } fn generate_aux( @@ -341,7 +345,7 @@ fn generate_aux( let aux_name = format!("{}Aux", request_def.name); if switch_field.kind == xcbdefs::SwitchKind::Case { - switch::emit_switch_type(generator, switch_field, &aux_name, true, true, None, out); + switch::emit_switch_type(generator, switch_field, &aux_name, false, true, None, out); } else { let doc = format!( "Auxiliary and optional information for the `{}` function", @@ -351,7 +355,7 @@ fn generate_aux( generator, switch_field, &aux_name, - true, + false, true, Some(&doc), out, @@ -912,6 +916,7 @@ fn emit_request_struct( outln!(out, "}}"); // Parsing implementation. + /* outln!( out, "/// Parse this request given its header, its body, and any fds that go along \ @@ -1071,6 +1076,7 @@ fn emit_request_struct( }); outln!(out, "}}"); } + */ }); outln!(out, "}}"); outln!( @@ -1118,6 +1124,7 @@ fn emit_request_struct( }); outln!(out, "}}"); + /* let request_trait = if request_def.reply.is_none() { "crate::x11_utils::VoidRequest" } else if gathered.reply_has_fds { @@ -1136,6 +1143,7 @@ fn emit_request_struct( outln!(out.indent(), "type Reply = {}Reply;", name); }; outln!(out, "}}"); + */ num_slices_opt.unwrap() } diff --git a/generator/src/generator/namespace/struct_type.rs b/generator/src/generator/namespace/struct_type.rs index 73b7d86..be9f268 100644 --- a/generator/src/generator/namespace/struct_type.rs +++ b/generator/src/generator/namespace/struct_type.rs @@ -25,7 +25,7 @@ pub(super) fn emit_struct_type( out: &mut Output, ) { assert!(!generate_serialize || fields_need_serialize); - assert!(matches!(parse_size_constraint, StructSizeConstraint::None) || generate_try_parse); + //assert!(matches!(parse_size_constraint, StructSizeConstraint::None) || generate_try_parse); let deducible_fields = gather_deducible_fields(fields); diff --git a/generator/src/generator/requests_replies.rs b/generator/src/generator/requests_replies.rs index 4b29d82..6100fb1 100644 --- a/generator/src/generator/requests_replies.rs +++ b/generator/src/generator/requests_replies.rs @@ -276,6 +276,7 @@ pub(super) fn generate(out: &mut Output, module: &xcbdefs::Module, mut enum_case ); }); outln!(out, "}}"); + /* outln!( out, "/// Get the matching reply parser (if any) for this request." @@ -340,6 +341,7 @@ pub(super) fn generate(out: &mut Output, module: &xcbdefs::Module, mut enum_case outln!(out, "}}"); }); outln!(out, "}}"); + */ }); outln!(out, "}}"); outln!(out, ""); ```

Diffstat for the generated code:

$ git diff --stat generator x11rb-protocol/                                                                                                       master 37
 generator/src/generator/namespace/request.rs     |   14 +-
 generator/src/generator/namespace/struct_type.rs |    2 +-
 generator/src/generator/requests_replies.rs      |    2 +
 x11rb-protocol/src/protocol/bigreq.rs            |   30 -
 x11rb-protocol/src/protocol/composite.rs         |  180 ------
 x11rb-protocol/src/protocol/damage.rs            |  100 ---
 x11rb-protocol/src/protocol/dbe.rs               |  184 ------
 x11rb-protocol/src/protocol/dpms.rs              |  199 ------
 x11rb-protocol/src/protocol/dri2.rs              |  507 ---------------
 x11rb-protocol/src/protocol/dri3.rs              |  349 ----------
 x11rb-protocol/src/protocol/ge.rs                |   36 --
 x11rb-protocol/src/protocol/glx.rs               | 3321 -----------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/mod.rs               | 3266 ----------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/present.rs           |  176 ------
 x11rb-protocol/src/protocol/randr.rs             | 1591 +---------------------------------------------
 x11rb-protocol/src/protocol/record.rs            |  222 -------
 x11rb-protocol/src/protocol/render.rs            | 1222 -----------------------------------
 x11rb-protocol/src/protocol/res.rs               |  219 -------
 x11rb-protocol/src/protocol/screensaver.rs       |  298 ---------
 x11rb-protocol/src/protocol/shape.rs             |  283 ---------
 x11rb-protocol/src/protocol/shm.rs               |  242 -------
 x11rb-protocol/src/protocol/sync.rs              |  558 ----------------
 x11rb-protocol/src/protocol/xc_misc.rs           |  100 ---
 x11rb-protocol/src/protocol/xevie.rs             |  167 -----
 x11rb-protocol/src/protocol/xf86dri.rs           |  369 -----------
 x11rb-protocol/src/protocol/xf86vidmode.rs       |  839 ------------------------
 x11rb-protocol/src/protocol/xfixes.rs            |  768 ----------------------
 x11rb-protocol/src/protocol/xinerama.rs          |  198 ------
 x11rb-protocol/src/protocol/xinput.rs            | 2248 -----------------------------------------------------------------
 x11rb-protocol/src/protocol/xkb.rs               | 2383 +-------------------------------------------------------------------
 x11rb-protocol/src/protocol/xprint.rs            |  667 --------------------
 x11rb-protocol/src/protocol/xproto.rs            | 4595 ------------------------------------------------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/xselinux.rs          |  691 --------------------
 x11rb-protocol/src/protocol/xtest.rs             |  110 ----
 x11rb-protocol/src/protocol/xv.rs                |  642 -------------------
 x11rb-protocol/src/protocol/xvmc.rs              |  267 --------
 36 files changed, 24 insertions(+), 27021 deletions(-)

One one hand, saving two seconds is not much. On the other hand, saving about 15% of time is a relative large gain. Dunno how that number evolved when the code is not simply deleted, but actually hidden behind a flag.

Besides that, not much caught my eye when looking through doc/generated_code.md. We could drop serialize_into from the Serialize trait. Or we could hide impl From<&FooEvent> for [u8; 32] behind a flag (since that is only necessary for conn.send_event(), which perhaps is not used much?). Some of the trait-business around request sending seems "too much" and might be able to be cut down a bit, but this part is complicated (there is a free function for the request calling a serialize function and in parallel there is a impl Request for FooRequest that also does stuff; I don't know where to cut).

Hm... perhaps the switch stuff can be simplified to the compiler with helper function? I am talking about the FooAux stuff that generates long chains of code for each case.

psychon commented 1 year ago

It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq and Ord ones, which are rarely used). I

That helped surprisingly much. Removing Debug, Copy, Clone breaks the build. Removing everything else makes time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput take 11.3 s instead of 15 s. 30% reduction in compile time.

diff --git a/generator/src/generator/namespace/helpers.rs b/generator/src/generator/namespace/helpers.rs
index f57ebb2..2c95235 100644
--- a/generator/src/generator/namespace/helpers.rs
+++ b/generator/src/generator/namespace/helpers.rs
@@ -225,12 +225,12 @@ impl Derives {
             debug: true,
             clone: true,
             copy: true,
-            default_: true,
-            partial_eq: true,
-            eq: true,
-            partial_ord: true,
-            ord: true,
-            hash: true,
+            default_: false,
+            partial_eq: false,
+            eq: false,
+            partial_ord: false,
+            ord: false,
+            hash: false,
         }
     }

Combined with yesterday's patch, the result is 10.4 seconds. Uhm.... I would have expected more.

psychon commented 12 months ago

I was testing with the cargo build --timings and cargo build --release --timings on the tip of the alacritty/alacritty repo. You could do that on the winit repo alone for example.

Okay, so here are some measurements for winit bbeacc46d52c9f0455a376efb4d561b33b14e8a1 using Debian's cargo 1.65 and rustc 1.70. I used [patch.crates-io] to override the x11rb and x11rb-protocol versions with my local checkout (master from a week ago (commit 89a3197e6ea7704a3561427404f25e76daae4a1c) and current master (4fcd1c632b328a8c554301bcb2cbcda6cc096282)) for the second build. So, I am testing the effect that #884 had.

For non-release x11rb, I had to apply the following patch to winit for it to build

```diff diff --git a/src/platform_impl/linux/x11/window.rs b/src/platform_impl/linux/x11/window.rs index bab2f8c..e7c2f0e 100644 --- a/src/platform_impl/linux/x11/window.rs +++ b/src/platform_impl/linux/x11/window.rs @@ -1322,7 +1322,7 @@ impl UnownedWindow { self.xwindow as xproto::Window, xproto::AtomEnum::WM_NORMAL_HINTS, )? - .reply()?; + .reply()?.unwrap(); callback(&mut normal_hints); normal_hints .set( @@ -1375,6 +1375,7 @@ impl UnownedWindow { ) .ok() .and_then(|cookie| cookie.reply().ok()) + .flatten() .and_then(|hints| hints.size_increment) .map(|(width, height)| (width as u32, height as u32).into()) } @@ -1764,6 +1765,7 @@ impl UnownedWindow { WmHints::get(self.xconn.xcb_connection(), self.xwindow as xproto::Window) .ok() .and_then(|cookie| cookie.reply().ok()) + .flatten() .unwrap_or_default(); wm_hints.urgent = request_type.is_some(); ```

I used cargo build --timings --no-default-features --features x11 and the same with --release added. The time shown are Total time and the time for x11rb-protocol according to --timings

	latest x11rb release	old x11rb master	current x11rb master
debug build 1	31.5s and 24.4s	29.1s and 27.3s	23.9s and 20.9s
debug build 2	31.6s and 24.6s	30.1s and 27.0s	23.8s and 20.6s
release build 1	45s and 39.1s	38.5s and 36.9s	31.9s and 30.5s
release build 2	44.9s and 38.7s	38.8s and 37.2s	32.2s and 30.0s

So... #883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!

kchibisov commented 12 months ago

So... https://github.com/psychon/x11rb/issues/883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!

I suggest to test single threaded build, because modern CPUs schedule things weirdly and the core used for longer crates could have less clock. With massive build times like we have, general testing could be done 'normally', since any improvement will be noticeable.

I'm afraid though, that feature gating won't help, unless it's enforced, since other crates could enforce features not used by winit (x11-clipboard) which will result in basically no perf changes. Though, we could just suggest everyone to disable default features as of now.

psychon commented 12 months ago

Though, we could just suggest everyone to disable default features as of now.

The new features are not enabled by default (at least for x11rb, x11rb-protocol has a default feature, but is used by x11rb with default features disabled). So, no need to have everyone disable default features for this.

psychon / x11rb

x11rb-protocol takes very long to build #883