jit: Transition from linear to more effective form

qwe661234 commented 1 year ago

The chained block structure used by both interpreter and tier-1 compiler is linear, with each block pointing only to the subsequent block. Enhancing a block to reference its previous block brings significant value, especially for hot spot profiling. This advancement paves the way for developing a graph-based intermediate representation (IR). In this IR, graph edges symbolize use-define chains. Rather than working on a two-tiered Control-Flow Graph (CFG) comprising basic blocks (tier 1) and instructions (tier 2), analyses and transformations will directly interact with and modify this use-def information in a streamlined, single-tiered graph structure.

The sfuzz project employs a custom intermediate representation. The initial step in the actual code generation process involves lifting the entire function into this intermediate representation. During the initialization phase, when the target is first loaded, the size of the function is determined. This is achieved by parsing the elf metadata and creating a hashmap that maps function start addresses to their respective sizes.

The IR-lifting process iterates through the original instructions and generates an IR instruction for each original instruction using a large switch statement. The following example illustrates what the intermediate representation might resemble for a very minimal function that essentially performs a branch operation based on a comparison in the first block.

Reference: A Simple Graph-Based Intermediate Representation

jserv commented 11 months ago

After the merge of tier-1 JIT compiler, it is time to revisit our IR.

jserv commented 11 months ago

Modern CPUs invest substantial effort in predicting these indirect branches, but the Branch Target Buffer (BTB) has its limitations in size. Eliminating any form of indirect call or jump, including those through dispatch tables, is greatly beneficial. This is because contemporary CPUs are equipped with large reorder buffers that can process extensive code efficiently, provided branch prediction is effective. However, in larger programs with widespread use of indirect jumps, optimal branch prediction becomes increasingly challenging.

jserv commented 9 months ago

FEX is an advanced x86 emulation frontend, crafted to facilitate the running of x86 and x86-64 binaries on Arm64 platforms, comparable to qemu-user. At the heart of FEX's emulation capability is the FEXCore, which employs an SSA-based Intermediate Representation (IR) crafted from the input x86-64 assembly. Working with SSA is particularly advantageous during the translation of x86-64 code to IR, throughout the optimization stages with custom passes, and when transferring the IR to FEX's CPU backends.

Key aspects of FEX's emulation IR include:

Precisely Defined IR Variable Sizes: It accommodates standard element sizes (1, 2, 4, 8 bytes, and certain 16-byte operations), as well as a flexible number of vector elements, distinguishing between float and integer operations based on the operation type.
Distinct Scalar and Vector IR Operations: Operations are clearly differentiated, such as scalar multiplication (MUL) vs. vector multiplication (VMUL).
Dedicated Load/Store Context IR Operations: These operations facilitate a clear distinction between guest memory and the monitored x86-64 state.
Specific CPUID IR Operation: Enables the return of complex data (data across four registers) and simplifies optimization for constant CPUID functions, allowing for further constant propagation.
Explicit Syscall Operation: Similar to the CPUID operation, this feature allows for efficient direct calls to the syscall handler by enabling constant propagation, reducing call overheads.
Branching Support within the IR: Includes conditional branching that either proceeds to the targeted branch or continues to the next block, and unconditional branching to jump directly to a specified block, aiming to align with LLVM semantics for block limitations without strict enforcement.
Debug Print Operation: For outputting values during debugging sessions.
Explicit Memory Access IR Operations: Designed for guest memory access, performing address translation into the virtual machine's memory space by adding the VM memory base to the 64-bit address. This approach allows for potential escape from the VM and is not deemed safe without JIT validation of the memory region for access correctness.

These features underscore FEX's design philosophy, emphasizing precise control, optimization flexibility, and efficient translation mechanisms within its emulation environment.

Reference: FEXCore IR

jserv commented 7 months ago

The Java HotSpot Server Compiler (C2) utilizes a Sea-of-Nodes IR form designed for high performance with minimal overhead, similar to LLVM's approach with its control flow graph (CFG). However, in textual IR presentations, the CFG is not depicted as a traditional 'graph' but rather through labels and jumps that effectively outline the graph's edges. Like C2’s IR, the Sea-of-Nodes IR can be described in a linear textual format and only visually represented as a "graph" when loaded into memory. This allows for flexibility in handling nodes without control dependencies, known as "floating nodes," which can be placed in any basic block in the textual format and reassigned in memory to maintain their floating characteristic.

While the current tier-2 JIT compiler, built with LLVM, offers aggressive optimizations, it is also resource-intensive, consuming considerable memory and prolonging compilation times. An alternative, the IR Framework, emerges as a viable option that enhances performance while minimizing memory usage. This framework not only defines an IR but also offers a streamlined API for IR construction, coupled with algorithms for optimization, scheduling, register allocation, and code generation. The code generated in-memory can be executed directly, potentially increasing efficiency.

The Ideal Graph Visualizer (IGV) is a tool designed for developers to analyze and troubleshoot performance issues by examining compilation graphs. It specifically focuses on IR graphs, which serve as a language-independent bridge between the source code and the machine code generated by compilers.

jserv commented 3 months ago

Inspired by rvdbt, we may adopt its QuickIR, a lightweight non-SSA internal representation used by the QMC compiler. QuickIR interacts with both local and global states; the former represents optimized temporaries, while the latter includes the emulated CPU state and any internal data structures attached to CPUState, a concept common to many emulators. The terms local and global also extend to control flow, where global branch instructions gbr and gbrind manage branches that escape the current translation region. If a particular instruction or its slowpath cannot be represented in QuickIR, a special hcall might be used to invoke a pre-registered guest runtime stub. These stubs are also generated from interpreter handlers, making it straightforward to extend the translated ISA without mandatory frontend support for new instructions.

QuickIR sample (1) - single basic block

00018c40:  slli   s2, s2, 8
00018c44:  or     s2, s2, s3
00018c48:  addi   s3, zr, 61
00018c4c:  jal    zr, 12

bb.0: succs[ ] preds[ ]
        #0 sll [@s2|i32] [@s2|i32] [$8|i32]
        #1 or  [@s2|i32] [@s2|i32] [@s3|i32]
        #3 mov [@s3|i32] [$3d|i32]
        #4 gbr [$18c58|i32]

QuickIR sample (2) - conditional branch representation

00018fc0:  lw     a5, a1, 0
00018fc4:  sw     zr, a1, 4
00018fc8:  addi   a6, a1, 0
00018fcc:  beq    a5, zr, 132

bb.0: succs[ 2 1 ] preds[ ]
    #0 mov [g:80|i32] [$18fc0|i32]
    #1 vmload:i32:s [@a5|i32] [@a1|i32]
        #2 mov [g:80|i32] [$18fc4|i32]
        #3 add [%32|i32] [@a1|i32] [$4|i32]
        #4 vmstore:i32:u [%32|i32] [$0|i32]
        #6 mov [@a6|i32] [@a1|i32]
        #9 brcc:eq [@a5|i32] [$0|i32]
bb.1: succs[ ] preds[ 0 ]
        #7 gbr [$18fd0|i32]
bb.2: succs[ ] preds[ 0 ]
        #8 gbr [$19050|i32]

g:80 - program counter location in global CPUState, manually flushed by frontend before translating "unsafe" vmload instruction
%32 - temporary local register, frontend may emit an arbitrary number of locals in a single region

jserv commented 2 months ago

sovietov_graph_irs_2023.pdf Slides from a talk "Graph-Based Intermediate Representations: An Overview and Perspectives". Great information starting from linear code to dataflow IR.

sysprog21 / rv32emu

jit: Transition from linear to more effective form #238