ianprime0509 / zig-xml

XML parser for Zig
http://ianjohnson.dev/zig-xml/
BSD Zero Clause License
14 stars 3 forks source link

Performance analysis and profiling #19

Closed ianprime0509 closed 1 year ago

ianprime0509 commented 1 year ago

After migrating zig-gobject to this library (from libxml2), codegen got significantly slower (I didn't measure it, but it felt at least 2-3x slower). This issue is for performance analysis to figure out why it's so slow.

ianprime0509 commented 1 year ago

Using poop, codegen for Gtk-4.0 takes nearly 1.5x as long using a ReleaseFast build of zig-gobject:

Benchmark 1 (10 runs): ../zig-gobject-original/zig-out/bin/zig-gobject lib/gir-files/ extras/ zig-out/gir-out Gtk-4.0
  measurement      mean ± σ               min … max                   outliers         delta
  wall_time        515.761ms ± 7.702ms    506.033ms … 532.834ms         0 ( 0%)        0%
  peak_rss         127M ± 78K             127M … 127M                   0 ( 0%)        0%
  cpu_cycles       1518157250 ± 14524726  1498932026 … 1544057560       0 ( 0%)        0%
  instructions     2597145399 ± 516126    2596680599 … 2598024155       0 ( 0%)        0%
  cache_references 50149282 ± 895291      48957364 … 51935356           0 ( 0%)        0%
  cache_misses     8018954 ± 46093        7932480 … 8075574             0 ( 0%)        0%
  branch_misses    3762985 ± 26402        3728587 … 3810838             0 ( 0%)        0%
Benchmark 2 (7 runs): zig-out/bin/zig-gobject lib/gir-files/ extras/ zig-out/gir-out Gtk-4.0
  measurement      mean ± σ               min … max                   outliers         delta
  wall_time        800.576ms ± 3.461ms    796.293ms … 807.2ms           0 ( 0%)        💩+ 55.2% ±  1.3%
  peak_rss         28M ± 0B               28M … 28M                     0 ( 0%)        ⚡- 77.2% ±  0.1%
  cpu_cycles       2848621170 ± 8349124   2841173707 … 2866109369       0 ( 0%)        💩+ 87.6% ±  0.9%
  instructions     7766704799 ± 282       7766704397 … 7766705263       0 ( 0%)        💩+199.0% ±  0.0%
  cache_references 28141336 ± 1845913     27127852 … 32309128           0 ( 0%)        ⚡- 43.9% ±  2.8%
  cache_misses     538034 ± 35820         486769 … 589172               0 ( 0%)        ⚡- 93.3% ±  0.6%
  branch_misses    5481095 ± 9824         5469430 … 5498793             0 ( 0%)        💩+ 45.7% ±  0.6%

This is not an entirely fair comparison, though, since the old version read the entire contents of the file before parsing it into a tree. (So, for example, the memory usage improvement is not directly attributable to zig-xml, but rather to the more efficient means of processing the file.)

My (amateur) inspection of profiling results using callgrind suggests that TokenReader is where most of the overhead is introduced, with notable contributions to the overhead coming from UTF-8 decoding and reencoding (as well as general memcpy usage which I can't narrow down).

ianprime0509 commented 1 year ago

Also, it's worth noting that, presumably due to https://github.com/ziglang/zig/issues/15254, using callgrind without linking libc results in unusable output (no function names), and for some reason just linking libc (no other changes) adds an extra 60-70ms on top of the execution time from the benchmark. It's not clear to me why this might be, but I doubt the performance difference has anything to do with this library.

ianprime0509 commented 1 year ago

I'm trying some benchmarking by running the read example on Gtk-4.0.gir (one of the largest XML files I know of which the current parser can handle lacking doctype support), and surprisingly, it seems that even just changing the decoder in the example from DefaultDecoder to Utf8Decoder gives a 12% performance improvement (both before and after compiled using ReleaseFast):

Benchmark 1 (8 runs): zig-out/bin/read-original ../zig-gobject/lib/gir-files/Gtk-4.0.gir
  measurement      mean ± σ               min … max                   outliers         delta
  wall_time        687.219ms ± 3.384ms    683.516ms … 692.924ms         0 ( 0%)        0%
  peak_rss         1M ± 1K                1M … 1M                       2 (25%)        0%
  cpu_cycles       1379362176 ± 2291238   1375929057 … 1382993883       0 ( 0%)        0%
  instructions     2893338168 ± 177       2893338053 … 2893338567       0 ( 0%)        0%
  cache_references 54136262 ± 602029      53572664 … 55353300           0 ( 0%)        0%
  cache_misses     756116 ± 226961        531048 … 1207533              0 ( 0%)        0%
  branch_misses    2372531 ± 7324         2362030 … 2380112             0 ( 0%)        0%
Benchmark 2 (9 runs): zig-out/bin/read ../zig-gobject/lib/gir-files/Gtk-4.0.gir
  measurement      mean ± σ               min … max                   outliers         delta
  wall_time        603.89ms ± 2.182ms     598.962ms … 606.579ms         1 (11%)        ⚡- 12.1% ±  0.4%
  peak_rss         1M ± 0B                1M … 1M                       0 ( 0%)        💩+ 14.6% ±  0.1%
  cpu_cycles       1055716613 ± 1352675   1053473878 … 1057272201       0 ( 0%)        ⚡- 23.5% ±  0.1%
  instructions     2639884508 ± 43        2639884476 … 2639884619       1 (11%)        ⚡-  8.8% ±  0.0%
  cache_references 51357417 ± 308444      50973223 … 51786092           0 ( 0%)        ⚡-  5.1% ±  0.9%
  cache_misses     667165 ± 101607        549794 … 831747               0 ( 0%)          - 11.8% ± 23.5%
  branch_misses    2330792 ± 5081         2325442 … 2337887             0 ( 0%)        ⚡-  1.8% ±  0.3%

However, this benchmark seems sus, because it says memory usage increased by over 14%, which makes no sense because DefaultDecoder uses a Utf8Decoder internally after detecting the encoding as UTF-8(???). Update: it appears this is an issue with poop, because switching the sides of the benchmark reverses all the measurements (as expected) except for cache_misses (which has way too much variance to be useful here) and peak_rss, which is consistently around +14% regardless of the benchmark order. In other words, this metric will not be useful for this benchmark right now; I'll focus on the other metrics for now.

ianprime0509 commented 1 year ago

To get a sense of the maximum performance benefit that could possibly be obtained by optimizing the decoding step, I did a benchmark using a StupidDecoder which always interprets a byte directly as a Unicode codepoint (so, in other words, an ISO 8859-1 decoder with no validation of the encoding declaration). The performance improvement was around 7% relative to the Utf8Decoder version.

A further experiment which eliminated the subsequent UTF-8 encoding step (by assuming every codepoint is just a byte) yielded no measurable performance improvement relative to the StupidDecoder version (and was still around 7% faster than the original Utf8Decoder version, as a second comparison).

ianprime0509 commented 1 year ago

As another data point, removing the end-of-line and attribute value normalization logic (making nextCodepoint just return nextCodepointRaw) yielded around a 7% performance improvement on its own, so allowing this logic to be disabled would be a good candidate for an optional feature.

ianprime0509 commented 1 year ago

OK, so on top of this, I really should have listened to the people who say GeneralPurposeAllocator is slow, because merely swapping it out for page_allocator in the example program yields a nearly 20% performance improvement with the UTF-8 decoder (around 17% improvement with the default decoder) 😂

ianprime0509 commented 1 year ago

I'm going to close this, since there is a reasonably good benchmarking setup present now, and I have improved performance considerably since this issue was opened. There is certainly more work to do on performance in general, though.