gimli-rs / addr2line

A cross-platform `addr2line` clone written in Rust, using `gimli`
https://docs.rs/addr2line
Apache License 2.0
270 stars 57 forks source link

Promote the speed of Gimli-based addr2line #308

Open marxin opened 1 week ago

marxin commented 1 week ago

First of all, thank you for the tool!

Let me explain how I discovered your tool. I was watching one of @jonhoo's videos about Inferno and I was curious how it works. I had a fuzzy memory (from the time I used FlameGraph) that the Perf scripts are slow so I tried to use Inferno for 2 projects: mold and MozillaFirefox. I collected a small profile for both of them and quickly realized the binutils' addr2line is really slow. My experiment just runs Firefox, opens https://html5test.com/ and finishes.

If I run perf report or (perf script), the loading in perf (mostly dominated with addr2line calls) takes >5 minutes. If I replace the binutils' addr2line with your implementation, I get down to 10 seconds which is an incredible improvement. Based on strace profile, it seems addr2line is called about 20K x for the biggest Firefox' library libxul.so. If I take first 1000 address queries and run them among the various addr2line implementations, I get the following numbers:

❯ hyperfine "cat /tmp/1000.txt | time /home/marxin/Programming/addr2line/target/release/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null" "cat /tmp/1000.txt | time llvm-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null" "cat /tmp/1000.txt | time /usr/bin/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null" "cat /tmp/1000.txt | eu-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null"
Benchmark 1: cat /tmp/1000.txt | time /home/marxin/Programming/addr2line/target/release/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
  Time (mean ± σ):      66.2 ms ±   2.6 ms    [User: 45.9 ms, System: 21.1 ms]
  Range (min … max):    61.5 ms …  72.5 ms    42 runs

Benchmark 2: cat /tmp/1000.txt | time llvm-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
  Time (mean ± σ):     218.2 ms ±   4.3 ms    [User: 192.2 ms, System: 26.8 ms]
  Range (min … max):   211.4 ms … 224.1 ms    13 runs

Benchmark 3: cat /tmp/1000.txt | time /usr/bin/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
  Time (mean ± σ):      2.181 s ±  0.071 s    [User: 1.947 s, System: 0.235 s]
  Range (min … max):    2.111 s …  2.351 s    10 runs

Benchmark 4: cat /tmp/1000.txt | eu-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
  Time (mean ± σ):      5.194 s ±  0.344 s    [User: 5.175 s, System: 0.017 s]
  Range (min … max):    4.776 s …  5.744 s    10 runs

Summary
  cat /tmp/1000.txt | time /home/marxin/Programming/addr2line/target/release/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null ran
    3.29 ± 0.14 times faster than cat /tmp/1000.txt | time llvm-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
   32.94 ± 1.67 times faster than cat /tmp/1000.txt | time /usr/bin/addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null
   78.43 ± 6.03 times faster than cat /tmp/1000.txt | eu-addr2line -e /usr/lib/debug/usr/lib64/firefox/libxul.so.debug -aif >/dev/null

From the visual inspection of the output (I've just compared Gimpli and Binutils), the results are pretty much the same! That being said, I think you should bravely promote your tool in the README.md file. Good job!

Note the slowness of Binutils' perf has new implications related to newer perf releases as the tool newly applies a timeout which can make the situation even worse: https://bugzilla.kernel.org/show_bug.cgi?id=218996.

philipc commented 1 week ago

Thanks for the feedback.

We used to do a benchmark comparison (there's still a benchmark script in the repository). However, keeping that up to date is additional maintenance effort that I didn't want to spend, and promoting old information feels wrong.

Another reason was that the bin application isn't as polished as it could be (although I have done some work on that recently).

marxin commented 1 week ago

Regarding the benchmarking effort: Is there anything I can help you with? It seems to me the existing Rust bench infrastructure already contains a way to get a random selection of addresses: https://github.com/gimli-rs/addr2line/blob/db88c5a0ed06f58acab05888acc1738877a221f3/benches/bench.rs#L50-L66

which can be easily transformed into something that would run various binaries for a selection of non-trivial ELF binaries with debug info. If you're willing to guide me, I can help you.

What kind of polishing do you mean?

Generally speaking, for a non-trivial binaries (medium to big applications), the performance is so outstanding to Binutils that I would not hesitate to mention it :tada:

mstange commented 6 days ago

@marxin You can also use samply import perf.data to look at the profile, and it might be even faster because it doesn't use a big text file as an intermediate. It uses this crate internally. https://github.com/mstange/samply

philipc commented 5 days ago

The main problem with running the benchmarks is having a reproducible environment containing recent versions of the tools. I don't think this is a hard problem, it just hasn't been a priority.

For polishing, the main thing lacking is in locating the various places where the debug info needs to be loaded from. #296 and #298 contained some improvements for this, but there's still more to be done. There's maybe other minor things, e.g. some unwraps in the cli code.

There's also wholesym-addr2line (part of samply) which probably does all the debug info locating correctly, and handles pdb as well.