Open kchibisov opened 1 year ago
Random guess would be "lots of XML for protocol stuff resulting in even more generated Rust code that rustc
has to cope with".
For example alacritty on my system builds in around 20seconds, and each of the x11rb-protocol takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).
I was curious about enabled features, but that is apparently not so easy to figure out just with a web browser.
According to https://github.com/alacritty/alacritty/blob/a58fb39b68caa34b073f66911c0ac6945f56eac2/Cargo.lock, a dependency on x11rb 0.12.0 comes from winit. x11-clipboard depends on x11rb 0.10.1
x11-clipboard just enables xfixes
(and apparently has already been updated to 0.12.0 in git): https://github.com/quininer/x11-clipboard/blob/c7fd2a93c93a7690478953db87de849d47ce9f2a/Cargo.toml#L12
winit enables allow-unsafe-code
, dl-libxcb
, randr
, resource_manager
, xinput
, xkb
: https://github.com/rust-windowing/winit/blob/master/Cargo.toml#L155
I am ignoring the non-X11-extension features (since I kinda expect the generated code to be the problem since it really is huge). This leads to the following files being enabled to build in x11rb-protocol/src/protocol/
:
xfixes.rs
randr.rs
xkb.rs
xinput.rs
xproto.rs
For a grand total of of 2.694.863 bytes / 71.008 lines of generated code. (mod.rs
in that file also has some more lines that are gates behind these features.)And random time measurements with cargo check
(which of course does less than cargo build
, but I am just trying to measure the influence of the extensions). I am running cargo clean && time cargo check -p x11rb-protocol $EXTRA_STUFF
:
--no-default-features
: 4.9 seconds of wall clock time--features resource_manager
: 5.3 seconds--features xfixes,randr,xkb,xinput
: 15.3 secondsSo, to me this is clearly due to the huge generated code files.
However, that still does not provide any hint on what to do about this. Adding more feature flags doesn't sound like a workable solution.
I guess the only way is to reduce the amount of things being generated, since for example wayland stuff is pretty fast to generate on the other side and we use a lot of protocols.
I'm not familiar with X11 internals though to say whether what you generate makes sense or not in the end of the day, but the amount of code you say sounds like really a lot (it's like alarcitty + glutin + winit + some wayland crates combined)... Especially when those types have derive
impls, etc, etc.
One technique to reduce code with generated source code is outlining
, but I'm not sure how it's applicable.
wayland-protocols
seems to use wayland-scanner
to generate code for the XML from another thingie called wayland-protocols
. The docs for wayland-scanner
do not really say / show what it generates.
I don't know much about Wayland, but there does not seem to be anything like nested structs in there. I hacked together a quick Python script to show which XML tags exist and found ['arg', 'copyright', 'description', 'entry', 'enum', 'event', 'interface', 'protocol', 'request']
. The same script run on the XML from xcb-proto finds ['allowed', 'bit', 'bitcase', 'brief', 'case', 'description', 'doc', 'enum', 'enumref', 'error', 'errorcopy', 'event', 'eventcopy', 'eventstruct', 'example', 'exprfield', 'fd', 'field', 'fieldref', 'import', 'item', 'length', 'list', 'listelement-ref', 'op', 'pad', 'paramref', 'popcount', 'reply', 'request', 'required_start_align', 'see', 'struct', 'sumof', 'switch', 'type', 'typedef', 'union', 'unop', 'value', 'xcb', 'xidtype', 'xidunion']
.
Perhaps wayland was actually designed to be simple. X11 certainly was not.
One technique to reduce code with generated source code is outlining, but I'm not sure how it's applicable.
https://github.com/psychon/x11rb/blob/master/doc/generated_code.md has some introduction to the generated code and it shows how much we generate. Random example: <enum>
s do not have a size in the XML (because C is so lenient about integer casts, I guess...), so for each enum, we generate code to convert to/from ever integer type that may fit, e.g. u8
, u16
, u32
. According to the example on that page, one <enum>
turned into seven <impl>
blocks.
I feel like cutting down on the build time here means cutting down on features. :-(
https://fasterthanli.me/articles/why-is-my-rust-build-so-slow suggested some crimes: RUSTC_BOOTSTRAP=1 cargo rustc -p x11rb-protocol --features xfixes,randr,xkb,xinput --verbose -- -Z self-profile
summarize
looks like this is just lots of code and thus taking long (first 20 lines of output):
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item | Self time | % of total time | Time | Item count | Incremental result hashing time |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen_emit_obj | 13.51s | 29.391 | 13.51s | 135 | 0.00ns |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck | 5.04s | 10.966 | 5.44s | 17024 | 79.56ms |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_passes | 4.72s | 10.274 | 4.72s | 1 | 0.00ns |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| codegen_module | 4.54s | 9.873 | 5.24s | 135 | 0.00ns |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck | 2.77s | 6.028 | 5.29s | 17024 | 6.54ms |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen | 1.50s | 3.263 | 15.01s | 135 | 0.00ns |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_drops_elaborated_and_const_checked | 1.36s | 2.949 | 1.56s | 17029 | 2.96ms |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| optimized_mir | 973.42ms | 2.118 | 2.53s | 14807 | 140.88ms |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built | 936.56ms | 2.038 | 1.39s | 17024 | 235.99ms |
The SVG from flamegraph
is not much more interesting. A third of the time is analyis
, which is stuff like borrow checking and type checking. Most everything else sounds like codegen.
The chrome://tracing
thingie didn't work. crox
produced a 1006 MiB JSON, but chromium didn't like it. On ui.perfetto.dev, I got an out-of-memory error.
It seems that the issue is the raw amount of code generated being passed to LLVM. It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq
and Ord
ones, which are rarely used). I think that might alleviate the issue, at least partially.
This won't help unfortunately. The only thing that could probably help is speculating on the data and reducing the amount of matches you have. Like if you look at common
code it has lots of very long matches doing basically the same thing as getting the same field from various event structs. From what I remember with X11
they sort of share the place in the struct when it comes to byte offset, because the X11 API from what I remember in Xlib uses a bunch of unions to model things.
Maybe with the help of some unsafe and speculations on how things are located in memory it could reduce at least giant matches?
Probably don't need a bunch of impls of from basic integer types for bitmasks, since you can obviously cast them and then call From
.
The issue is not only about the build times, but the amount of code ending in the binary. The stuff like Eq
and Ord
won't help because they simply will be optimized out.
Idea similar to one I tried in winit
: replace the match statement with a per-display HashMap
that's keyed to take some part of the bytes and then translate that to a function pointer that parses the bytes as the relevant request/response/event/whatever. It will shift some part of the complexity from compile time to runtime (although that may be a good thing), but it should eliminate these huge matches.
I made an attempt to delete all code related to parsing requests (we normally only have to serialise them, but our xtrace
example uses code for parsing; this seemed like something that could be hidden behind a feature flag). The time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput
measurement from above says 13 seconds, so this saves about 2 seconds.
The result does not properly build (e.g. x11rb is broken), but at least x11rb-protocol builds fine.
Diffstat for the generated code:
$ git diff --stat generator x11rb-protocol/ master 37
generator/src/generator/namespace/request.rs | 14 +-
generator/src/generator/namespace/struct_type.rs | 2 +-
generator/src/generator/requests_replies.rs | 2 +
x11rb-protocol/src/protocol/bigreq.rs | 30 -
x11rb-protocol/src/protocol/composite.rs | 180 ------
x11rb-protocol/src/protocol/damage.rs | 100 ---
x11rb-protocol/src/protocol/dbe.rs | 184 ------
x11rb-protocol/src/protocol/dpms.rs | 199 ------
x11rb-protocol/src/protocol/dri2.rs | 507 ---------------
x11rb-protocol/src/protocol/dri3.rs | 349 ----------
x11rb-protocol/src/protocol/ge.rs | 36 --
x11rb-protocol/src/protocol/glx.rs | 3321 -----------------------------------------------------------------------------------------------
x11rb-protocol/src/protocol/mod.rs | 3266 ----------------------------------------------------------------------------------------------
x11rb-protocol/src/protocol/present.rs | 176 ------
x11rb-protocol/src/protocol/randr.rs | 1591 +---------------------------------------------
x11rb-protocol/src/protocol/record.rs | 222 -------
x11rb-protocol/src/protocol/render.rs | 1222 -----------------------------------
x11rb-protocol/src/protocol/res.rs | 219 -------
x11rb-protocol/src/protocol/screensaver.rs | 298 ---------
x11rb-protocol/src/protocol/shape.rs | 283 ---------
x11rb-protocol/src/protocol/shm.rs | 242 -------
x11rb-protocol/src/protocol/sync.rs | 558 ----------------
x11rb-protocol/src/protocol/xc_misc.rs | 100 ---
x11rb-protocol/src/protocol/xevie.rs | 167 -----
x11rb-protocol/src/protocol/xf86dri.rs | 369 -----------
x11rb-protocol/src/protocol/xf86vidmode.rs | 839 ------------------------
x11rb-protocol/src/protocol/xfixes.rs | 768 ----------------------
x11rb-protocol/src/protocol/xinerama.rs | 198 ------
x11rb-protocol/src/protocol/xinput.rs | 2248 -----------------------------------------------------------------
x11rb-protocol/src/protocol/xkb.rs | 2383 +-------------------------------------------------------------------
x11rb-protocol/src/protocol/xprint.rs | 667 --------------------
x11rb-protocol/src/protocol/xproto.rs | 4595 ------------------------------------------------------------------------------------------------------------------------------------
x11rb-protocol/src/protocol/xselinux.rs | 691 --------------------
x11rb-protocol/src/protocol/xtest.rs | 110 ----
x11rb-protocol/src/protocol/xv.rs | 642 -------------------
x11rb-protocol/src/protocol/xvmc.rs | 267 --------
36 files changed, 24 insertions(+), 27021 deletions(-)
One one hand, saving two seconds is not much. On the other hand, saving about 15% of time is a relative large gain. Dunno how that number evolved when the code is not simply deleted, but actually hidden behind a flag.
Besides that, not much caught my eye when looking through doc/generated_code.md
. We could drop serialize_into
from the Serialize
trait. Or we could hide impl From<&FooEvent> for [u8; 32]
behind a flag (since that is only necessary for conn.send_event()
, which perhaps is not used much?). Some of the trait-business around request sending seems "too much" and might be able to be cut down a bit, but this part is complicated (there is a free function for the request calling a serialize
function and in parallel there is a impl Request for FooRequest
that also does stuff; I don't know where to cut).
Hm... perhaps the switch stuff can be simplified to the compiler with helper function? I am talking about the FooAux
stuff that generates long chains of code for each case.
It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq and Ord ones, which are rarely used). I
That helped surprisingly much. Removing Debug
, Copy
, Clone
breaks the build. Removing everything else makes time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput
take 11.3 s instead of 15 s. 30% reduction in compile time.
diff --git a/generator/src/generator/namespace/helpers.rs b/generator/src/generator/namespace/helpers.rs
index f57ebb2..2c95235 100644
--- a/generator/src/generator/namespace/helpers.rs
+++ b/generator/src/generator/namespace/helpers.rs
@@ -225,12 +225,12 @@ impl Derives {
debug: true,
clone: true,
copy: true,
- default_: true,
- partial_eq: true,
- eq: true,
- partial_ord: true,
- ord: true,
- hash: true,
+ default_: false,
+ partial_eq: false,
+ eq: false,
+ partial_ord: false,
+ ord: false,
+ hash: false,
}
}
Combined with yesterday's patch, the result is 10.4 seconds. Uhm.... I would have expected more.
I was testing with the cargo build --timings and cargo build --release --timings on the tip of the alacritty/alacritty repo. You could do that on the winit repo alone for example.
Okay, so here are some measurements for winit
bbeacc46d52c9f0455a376efb4d561b33b14e8a1 using Debian's cargo 1.65 and rustc 1.70. I used [patch.crates-io]
to override the x11rb and x11rb-protocol versions with my local checkout (master from a week ago (commit 89a3197e6ea7704a3561427404f25e76daae4a1c) and current master (4fcd1c632b328a8c554301bcb2cbcda6cc096282)) for the second build. So, I am testing the effect that #884 had.
I used cargo build --timings --no-default-features --features x11
and the same with --release
added. The time shown are Total time
and the time for x11rb-protocol
according to --timings
latest x11rb release | old x11rb master | current x11rb master | |
---|---|---|---|
debug build 1 | 31.5s and 24.4s | 29.1s and 27.3s | 23.9s and 20.9s |
debug build 2 | 31.6s and 24.6s | 30.1s and 27.0s | 23.8s and 20.6s |
release build 1 | 45s and 39.1s | 38.5s and 36.9s | 31.9s and 30.5s |
release build 2 | 44.9s and 38.7s | 38.8s and 37.2s | 32.2s and 30.0s |
So... #883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!
So... https://github.com/psychon/x11rb/issues/883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!
I suggest to test single threaded build, because modern CPUs schedule things weirdly and the core used for longer crates could have less clock. With massive build times like we have, general testing could be done 'normally', since any improvement will be noticeable.
I'm afraid though, that feature gating won't help, unless it's enforced, since other crates could enforce features not used by winit (x11-clipboard) which will result in basically no perf changes. Though, we could just suggest everyone to disable default features as of now.
Though, we could just suggest everyone to disable default features as of now.
The new features are not enabled by default (at least for x11rb, x11rb-protocol has a default feature, but is used by x11rb with default features disabled). So, no need to have everyone disable default features for this.
During profiling both
debug
andrelease
builds of both alacritty and winit, I've noticed thatx11rb-protocol
builds around 5x longer than most massive deps like (serde).And if you happen to have 2 copies of
x11rb-protocol
in your tree, it'll really slow things down.For example alacritty on my system builds in around 20seconds, and each of the
x11rb-protocol
takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).I was testing with the
cargo build --timings
andcargo build --release --timings
on thetip
of thealacritty/alacritty
repo. You could do that on the winit repo alone for example.It was used on both
rustc 1.72.0
andrustc 1.74.0-nightly (0288f2e19 2023-09-25)
showing the exact same compilation times.