Closed ianprime0509 closed 1 year ago
Using poop, codegen for Gtk-4.0 takes nearly 1.5x as long using a ReleaseFast
build of zig-gobject:
Benchmark 1 (10 runs): ../zig-gobject-original/zig-out/bin/zig-gobject lib/gir-files/ extras/ zig-out/gir-out Gtk-4.0
measurement mean ± σ min … max outliers delta
wall_time 515.761ms ± 7.702ms 506.033ms … 532.834ms 0 ( 0%) 0%
peak_rss 127M ± 78K 127M … 127M 0 ( 0%) 0%
cpu_cycles 1518157250 ± 14524726 1498932026 … 1544057560 0 ( 0%) 0%
instructions 2597145399 ± 516126 2596680599 … 2598024155 0 ( 0%) 0%
cache_references 50149282 ± 895291 48957364 … 51935356 0 ( 0%) 0%
cache_misses 8018954 ± 46093 7932480 … 8075574 0 ( 0%) 0%
branch_misses 3762985 ± 26402 3728587 … 3810838 0 ( 0%) 0%
Benchmark 2 (7 runs): zig-out/bin/zig-gobject lib/gir-files/ extras/ zig-out/gir-out Gtk-4.0
measurement mean ± σ min … max outliers delta
wall_time 800.576ms ± 3.461ms 796.293ms … 807.2ms 0 ( 0%) 💩+ 55.2% ± 1.3%
peak_rss 28M ± 0B 28M … 28M 0 ( 0%) ⚡- 77.2% ± 0.1%
cpu_cycles 2848621170 ± 8349124 2841173707 … 2866109369 0 ( 0%) 💩+ 87.6% ± 0.9%
instructions 7766704799 ± 282 7766704397 … 7766705263 0 ( 0%) 💩+199.0% ± 0.0%
cache_references 28141336 ± 1845913 27127852 … 32309128 0 ( 0%) ⚡- 43.9% ± 2.8%
cache_misses 538034 ± 35820 486769 … 589172 0 ( 0%) ⚡- 93.3% ± 0.6%
branch_misses 5481095 ± 9824 5469430 … 5498793 0 ( 0%) 💩+ 45.7% ± 0.6%
This is not an entirely fair comparison, though, since the old version read the entire contents of the file before parsing it into a tree. (So, for example, the memory usage improvement is not directly attributable to zig-xml, but rather to the more efficient means of processing the file.)
My (amateur) inspection of profiling results using callgrind suggests that TokenReader
is where most of the overhead is introduced, with notable contributions to the overhead coming from UTF-8 decoding and reencoding (as well as general memcpy
usage which I can't narrow down).
Also, it's worth noting that, presumably due to https://github.com/ziglang/zig/issues/15254, using callgrind without linking libc results in unusable output (no function names), and for some reason just linking libc (no other changes) adds an extra 60-70ms on top of the execution time from the benchmark. It's not clear to me why this might be, but I doubt the performance difference has anything to do with this library.
I'm trying some benchmarking by running the read
example on Gtk-4.0.gir
(one of the largest XML files I know of which the current parser can handle lacking doctype support), and surprisingly, it seems that even just changing the decoder in the example from DefaultDecoder
to Utf8Decoder
gives a 12% performance improvement (both before and after compiled using ReleaseFast
):
Benchmark 1 (8 runs): zig-out/bin/read-original ../zig-gobject/lib/gir-files/Gtk-4.0.gir
measurement mean ± σ min … max outliers delta
wall_time 687.219ms ± 3.384ms 683.516ms … 692.924ms 0 ( 0%) 0%
peak_rss 1M ± 1K 1M … 1M 2 (25%) 0%
cpu_cycles 1379362176 ± 2291238 1375929057 … 1382993883 0 ( 0%) 0%
instructions 2893338168 ± 177 2893338053 … 2893338567 0 ( 0%) 0%
cache_references 54136262 ± 602029 53572664 … 55353300 0 ( 0%) 0%
cache_misses 756116 ± 226961 531048 … 1207533 0 ( 0%) 0%
branch_misses 2372531 ± 7324 2362030 … 2380112 0 ( 0%) 0%
Benchmark 2 (9 runs): zig-out/bin/read ../zig-gobject/lib/gir-files/Gtk-4.0.gir
measurement mean ± σ min … max outliers delta
wall_time 603.89ms ± 2.182ms 598.962ms … 606.579ms 1 (11%) ⚡- 12.1% ± 0.4%
peak_rss 1M ± 0B 1M … 1M 0 ( 0%) 💩+ 14.6% ± 0.1%
cpu_cycles 1055716613 ± 1352675 1053473878 … 1057272201 0 ( 0%) ⚡- 23.5% ± 0.1%
instructions 2639884508 ± 43 2639884476 … 2639884619 1 (11%) ⚡- 8.8% ± 0.0%
cache_references 51357417 ± 308444 50973223 … 51786092 0 ( 0%) ⚡- 5.1% ± 0.9%
cache_misses 667165 ± 101607 549794 … 831747 0 ( 0%) - 11.8% ± 23.5%
branch_misses 2330792 ± 5081 2325442 … 2337887 0 ( 0%) ⚡- 1.8% ± 0.3%
However, this benchmark seems sus, because it says memory usage increased by over 14%, which makes no sense because Update: it appears this is an issue with poop, because switching the sides of the benchmark reverses all the measurements (as expected) except for DefaultDecoder
uses a Utf8Decoder
internally after detecting the encoding as UTF-8(???).cache_misses
(which has way too much variance to be useful here) and peak_rss
, which is consistently around +14% regardless of the benchmark order. In other words, this metric will not be useful for this benchmark right now; I'll focus on the other metrics for now.
To get a sense of the maximum performance benefit that could possibly be obtained by optimizing the decoding step, I did a benchmark using a StupidDecoder
which always interprets a byte directly as a Unicode codepoint (so, in other words, an ISO 8859-1 decoder with no validation of the encoding declaration). The performance improvement was around 7% relative to the Utf8Decoder
version.
A further experiment which eliminated the subsequent UTF-8 encoding step (by assuming every codepoint is just a byte) yielded no measurable performance improvement relative to the StupidDecoder
version (and was still around 7% faster than the original Utf8Decoder
version, as a second comparison).
As another data point, removing the end-of-line and attribute value normalization logic (making nextCodepoint
just return nextCodepointRaw
) yielded around a 7% performance improvement on its own, so allowing this logic to be disabled would be a good candidate for an optional feature.
OK, so on top of this, I really should have listened to the people who say GeneralPurposeAllocator
is slow, because merely swapping it out for page_allocator
in the example program yields a nearly 20% performance improvement with the UTF-8 decoder (around 17% improvement with the default decoder) 😂
I'm going to close this, since there is a reasonably good benchmarking setup present now, and I have improved performance considerably since this issue was opened. There is certainly more work to do on performance in general, though.
After migrating zig-gobject to this library (from libxml2), codegen got significantly slower (I didn't measure it, but it felt at least 2-3x slower). This issue is for performance analysis to figure out why it's so slow.