RaptorJIT+Studio: Analyzing the evolving C heap of a JIT compiler

Let me tell you about a ~~~cute hack~~~ long story for logging and making sense of diagnostic data from the RaptorJIT virtual machine. (Note: The pretty screenshots are at the bottom.)

RaptorJIT is a high-performance Lua virtual machine (LuaJIT fork) and it has to reconcile a couple of tricky requirements for diagnostics. On the one hand we need full diagnostic data to always be available in production (of course!) On the other hand production applications need to run at maximum speed and with absolute minimum latency. So how do we support both?

The approach taken here is to split the diagnostic work into two parts. The RaptorJIT virtual machine produces raw data as efficiently as possible and then separate tooling analyzes this data.

The virtual machine is kept as simple and efficient as possible: the logging needs to be enabled at all times and there can't be any measurable overhead (and certainly not any crashes.) The logging also needs to be comprehensive. We want to capture the loaded code, the JIT compilation attempts, the intermediate representations of generated code, and so on.

The analysis tooling then has to absorb all of the complexity. This is tolerable because it runs offline, out of harms way, and can be written in a relaxed high-level style. Accepting the complexity can be beneficial too: making the tooling understand internal data structures of the virtual machine makes it possible to invent new analysis to apply to existing data. That's a lot better than asking users, "Please take this updated virtual machine into production, make it crash, and send new logs."

Let's roll up our sleeves and look at how this works.

RaptorJIT

The RaptorJIT diagnostic data production is implemented in lj_auditlog.c. It's only about 100 LOC. It opens a binary log file and writes two kinds of message in msgpack format. (Aside: msgpack rocks.)

The first kind of log message is called memory. These message snapshot the contents of a raw piece of memory in the process address space. The log message is an array of bytes, the 64-bit starting address, and an optional "hint" to help with decoding. The application is responsible for logging each block of memory that the analysis tools will need.

The second kind of log message is called event. These messages show when something interesting has happened. The log message is an event name and other free-form attributes, including references to previously logged memory.

The same piece of memory can be logged many times to track its evolution. The memory references in event log messages are understood to refer to the memory at the time the event was logged. So when the tooling wants to "peek" a byte of process memory it will need to search backwards in the log starting from the event of interest. This way we can track the evolution of the process heap and allow the VM to reuse the same memory for different purposes e.g. reusing the same JIT datastructures to compile different code at different times.

Here is what some raw log looks like when decoded from binary msgpack into json:

$ msgpack2json -d -p -c -i audit.log
{
    "type": "memory",
    "hint": "GCstr",
    "address": 139675345683296,
    "data": <bin of size 32>
}
{
    "type": "memory",
    "hint": "GCproto",
    "address": 139675345683344,
    "data": <bin of size 168>
}
{
    "type": "event",
    "event": "new_prototype",
    "GCproto": 139675345683344
}

We can read this backwards:

There is an event of type new_prototype, which means that the virtual machine defined a new bytecode function. This event references a GCproto object at address 139675345683344 (0x7f08b35d0390).
There is a 168-byte block of memory at address 139675345683344 logged. This is the address referenced in the event. The contents is the raw C datatype struct GCproto which includes the bytecode, the debug info to resolve source line numbers, etc. It also references the name of the source file that the bytecode was loaded from, which is a Lua string object stored elsewhere in memory.
Finally another object is stored: it's the 32-byte GCstr object containing the name of the source file. The address of this object is 139675345683296 (0x7f08b35d0390) and this happens to be referenced by the previous GCproto object (you can't see the address in the log because it's inside the <bin of size 168>.)

Half-mission accomplished! The RaptorJIT virtual machine is now exposing its raw state to the outside world very efficiently, and the code is so simple that we can be confident about putting it into production.

Studio

The second part of the problem is to extract high-level information from the logs. We are not interested in reading hex dumps! We want the tooling to present really high-level information about which code has been JITed, how the compiled code has been optimized, which compilation attempts failed and why, and which code is hot in the profiler, and so on.

We solve this problem using Studio, which is "an extensible debugger for the data produced by complex applications." Studio is the perfect fit for this application - as it should be, since this problem was the motivation for creating the Studio project :-).

We take the direct "brute force" approach. This is conceptually like reading a coredump into gdb and writing macros to inspect it, but Studio means the tools will be written in Pharo Smalltalk with any awkward chores offloaded with Nix scripts.

Here is the plan of attack:

Read RaptorJIT DWARF metadata to understand the memory layout of native C objects.
Decode application types (GCproto, GCstr, etc) into higher-level Smalltalk objects.
Extend the Glamorous Inspector framework to interactively browse our objects.
Use the Agile Visualization framework to visualize the more complex objects.

Let's do this!

Read RaptorJIT DWARF metadata

Looking at DWARF for the first time, several things are immediately apparent:

The DWARF format is elaborate and arcane.
We wouldn't want to touch the libdwarf C library with a ten foot pole.
None of the dwarf2foo utilities on the internet seem to really work.

This is great news: it means that we are perfectly justified in cheating. (The alternative would be to become DWARF experts, but what we are really trying to do here is develop a JIT compiler, remember?)

Cheating is easy with Nix. Nix provides "dependency heaven." We can write simple scripts, we can use arbitrary versions of random utility programs, and we can be confident that everything will work the same way every time.

We create a Nix API with an elf2json function that converts a messy ELF file (produced by clang/gcc during RaptorJIT compilation) into a simple JSON description of what we care about, which are the definitions of types and #define macros and so on.

The Nix code works in three steps:

Use readelf (a standard utility) to dump the DWARF info as text.
Use dwarf2yaml.awk (a ~20 LOC script) to convert the text into well-formed YAML.
Use yaml2json.py (a 2-line Python script) to convert the YAML into JSON.

(Why YAML in the middle? Just because it's easier than JSON to generate from awk.)

Sounds horrible, right? Wrong! Nix has stone-cold control of all of these dependencies. Each run will produce exactly the expected results, using exactly the same versions of readelf/awk/python/etc. They will even built with the exact same gcc version, linked with the exact same libc version, etc. If we decide to update our dependencies in the future we can easily debug regressions too (lukego/blog#17). Throwing in new dependencies is painless with Nix.

(Nix is a big deal. Check it out over at InfoQ if you haven't already.)

Decode application types

Now we want to read the preprocessed JSON DWARF metadata and use it to make plain old objects out of the auditlog. There is no magic here: we write Smalltalk code to do exactly that!

This is not rocket science but it does take a bit of typing. The good news is that we can reuse the DWARF support code in the future to decode other programs compiled with C toolchains.

Extend the Glamorous Toolkit

Now the fun starts. It only takes a page or two of code to teach the graphical GTInspector how to display and navigate through the C objects in the log. Here is what that looks like (excerpted from the Studio Manual):

dwarf

This is nifty: now we can clickety-click out way around to see what data we have. The representation is low-level but it does have access to all the C type definitions, typedef names, enum and #define values, and so on. It makes a nice bottom layer to build on top of.

Visualize more complex objects

Now the pressure is on: we need to actually present some useful high-level information! This turns out to be pretty fun and easy using the high-level frameworks that Pharo provides. We can whip up step-by-step multi-panel navigation flows, we can present objects visually, and we can interactively "drill down" on everything using buttons / clicks / mouseovers / etc.

Here is one example view: browsing profiler data to see which JIT code is "hot" and then visualizing the way this code was compiled. The graph shows the compiled Intermediate Representation instructions arranged using their SSA references (data dependencies.)

studio-raptorjit

The objects that we see on the screen are all backed by Smalltalk objects that are initialized from the log, each object can be viewed in multiple different ways, and we can navigate the links between objects interactively. It's really fun to click around in :-).

The end

So! We wanted the RaptorJIT VM to efficiently log raw diagnostic data, and we wanted to create convenient developer tools for out-of-harms-way offline analysis. We have done both. Problem solved!

If you find this kind of hacking interesting then consider Watching the RaptorJIT and Studio repositories on Github. The projects are new - especially Studio - so don't be shy to ask questions with Issues!

lukego / blog