[zorg][lldb] Add new LLDB metrics bot

Michael137 commented 1 month ago

This patch adds a new job to collect LLDB metrics.

This is heavily based on the debuginfo-statistics job (but currently doesn't publish data to LNT).

Currently this job would do the following:

Check out the Clang 19.x release and build it
Use the LLDB and Clang from the lldb-cmake-intel job (not actually sure if that job publishes the right artifacts at the moment) to run the run_lldb_metrics.sh script.
Said script will attach LLDB to Clang/LLDB and run various commands. Then it will dump the statistics dump command to stdout (note we don't do any kind of averaging of these over multiple runs, since the metrics we care about should stable across runs). We also currently run these test-scenarios through hyperfine and dump the timing data. But maybe for a first attempt this isn't necessary.

Michael137 commented 1 month ago

First stab at this. The plan is to collect metrics that relate to type completion. So we get insight into the impact of changes around the area. E.g.,:

How many types did we fully resolve?
How many types did we keep as forward declarations?
How many definitions did ASTImporter import?
How many FindTypes/FindNamespace/FindFunctions calls did we perform?
How many object files did we scan to find a type?

These don't all exist yet, but the idea is to add them to the statistics dump command.

Currently I'm also timing the test scenarios. That metric is presumably much less stable and I wouldn't be opposed removing that in the first iteration of this bot.

Any thoughts/concerns/wishlist items?

labath commented 1 month ago

I think this could be interesting. I don't have much to add in the way of specifics, just a couple of questions/observations:

(where) will we be able to see the results of these benchmarks?
When benchmarking a debugger, there are many moving parts: a) the debugger itself; b) the code being debugged (inferior); c) the compiler compiling the inferior (; and possibly d) compiler compiling the debugger). Moving all four makes it hard to interpret the results. Based on the mentions of "historic compilers" in the patch, I'm deducing that you're trying to fix some of these, but I wasn't able to figure out which ones. Can you tell me which of these are fixed?

Michael137 commented 1 month ago

(where) will we be able to see the results of these benchmarks?

Currently I just dump them to the console. The idea in the near future is to publish the data to something like LNT (though that currently seems to be down), and plot some sort of time series out of it.

When benchmarking a debugger, there are many moving parts: a) the debugger itself; b) the code being debugged (inferior); c) the compiler compiling the inferior (; and possibly d) compiler compiling the debugger). Moving all four makes it hard to interpret the results. Based on the mentions of "historic compilers" in the patch, I'm deducing that you're trying to fix some of these, but I wasn't able to figure out which ones. Can you tell me which of these are fixed?

Good point, I'll try to clarify this in the pipeline definition.

(a) The "host" compiler/LLDB is taken from whatever the LLDB incremental built/used (I haven't checked that those artifacts are available, but would be nice if we could re-use that). The metrics we collect are from that "host" LLDB that we fetched.

(b) The debugger/compiler that we're debugging (in the HISTORIC_BUILD_DIR) is pinned to the llvm-19.x release (which seemed like a good starting point for something stable)

(c) We use the compiler from (a) to build the "historic" Clang/LLDB.

(d) The compiler compiling the debugger in (a) is the clang produced by the clang-stage2 buildbot

We could alternatively choose not to re-use the artifacts from other buildbots and instead build a brand new LLDB/Clang from top-of-tree using a pinned version of Clang. In that case (b) and (d) would be stable, while (a) and (c) followed top-of-tree. That does seem like a more maintainable situation (at the cost of rebuilding Clang/LLDB more often)

labath commented 1 month ago

(where) will we be able to see the results of these benchmarks?

Currently I just dump them to the console. The idea in the near future is to publish the data to something like LNT (though that currently seems to be down), and plot some sort of time series out of it.

Got it. Thanks.

When benchmarking a debugger, there are many moving parts: a) the debugger itself; b) the code being debugged (inferior); c) the compiler compiling the inferior (; and possibly d) compiler compiling the debugger). Moving all four makes it hard to interpret the results. Based on the mentions of "historic compilers" in the patch, I'm deducing that you're trying to fix some of these, but I wasn't able to figure out which ones. Can you tell me which of these are fixed?

Good point, I'll try to clarify this in the pipeline definition.

(a) The "host" compiler/LLDB is taken from whatever the LLDB incremental built/used (I haven't checked that those artifacts are available, but would be nice if we could re-use that). The metrics we collect are from that "host" LLDB that we fetched.

One argument for not fetching those is that you might want to use different build options for each. E.g. the incremental build bot might want to enable assertions or stuff, whereas the benchmarking bot might not.

(b) The debugger/compiler that we're debugging (in the HISTORIC_BUILD_DIR) is pinned to the llvm-19.x release (which seemed like a good starting point for something stable)

(c) We use the compiler from (a) to build the "historic" Clang/LLDB.

(d) The compiler compiling the debugger in (a) is the clang produced by the clang-stage2 buildbot

:+1:

We could alternatively choose not to re-use the artifacts from other buildbots and instead build a brand new LLDB/Clang from top-of-tree using a pinned version of Clang. In that case (b) and (d) would be stable, while (a) and (c) followed top-of-tree. That does seem like a more maintainable situation (at the cost of rebuilding Clang/LLDB more often)

I think both of these are reasonable choices, and its up to you to choose which one makes most sense for your use case. I'm interesting in the details just so that I know how to interpret the results.

llvm / llvm-zorg

[zorg][lldb] Add new LLDB metrics bot #278