The prototype of the fast tier-1 JIT compiler (#283) introduces an advanced sampling-based profiler and a tier-up optimization strategy. The two fundamental components for this are: 1) a multi-tiered recompilation system that is well-balanced, and 2) a profiling system that is both reliable and has low overhead for enhancing performance.

Initially, all RISC-V instructions are processed by an optimizing interpreter, with each block having a counter for tracking function calls and loop iterations, starting from a preset threshold. The interpreter's detection of a loop's backward branch involves loop iteration count assessment and counter decrement adjustment based on this count. Should the iteration count be substantial, JIT compilation proceeds immediately, bypassing the need for the counter to deplete fully.

 ______     ____________      ___________      ___________      _________
|      |   |            |    |           |    |           |    |         |
| RV32 |   | Control    |    | LLVM IR   |    | Optimized |    | Machine |
| IR   |-->| Flow Graph |-*->| (initial) |-*->| LLVM IR   |-*->| code    |
|______|   |____________| |  |___________| |  |___________| |  |_________|
                          |               |                |
                          | Link          | Optimize       | JIT compile
                    ______|_______     ___|____          __|____
                   |              |   |        |        |       |
                   |              |   | LLVM   |        | LLVM  |
                   | LLVM Bitcode |   | Passes |        | MCJIT |
                   |______________|   |________|        |_______|

The tier-2 optimizing JIT compiler further enhances optimizations, including aggressive block inlining, an expanded range of dataflow optimizations, and superior code quality through an additional code generation pass. These optimizations span various techniques like copy propagation, array bound check elimination, and dead code elimination. Additional optimizations at this level encompass escape analysis, code scheduling, and DAG-based optimizations like loop versioning and pre-pass code scheduling, with an increased maximum iteration count for dataflow-based optimizations. This sophisticated approach ensures that blocks needing the most optimization receive it timely and effectively, leveraging a tiered system for optimal performance.

Quote from Design and Evaluation of Dynamic Optimizations for a Java Just-In-Time Compiler:

Our system employs a dataflow-based intra- method region selection algorithm that uses both static heuristics and dynamic profiles, and then integrates the algorithm into the inlining process to extract effective inter- procedural regions. Our empirical evaluation demonstrates that our system can improve the performance and reduce the compilation overhead significantly.
The third IR is called a directed acyclic graph (DAG). This is also a register-based representation and is in the form of static single assignment (SSA). ... This simply indicates how the DAG representation looks and ignores all of the exception checks necessary for the array accesses within the loop. The actual DAG includes the nodes for exception checking instructions and the edges representing exception dependencies.
To minimize the bottom-line overhead, this profiler focuses only on detecting hot methods and does not collect other information such as the call context of the given method. Instead, additional information is collected by a different profiler for only selected hot methods.

The development of the Tier-2 JIT compiler will utilize LLVM, specifically leveraging the ORC lazy JIT feature provided by LLVM. This approach will harness LLVM's capabilities for efficient and dynamic JIT compilation. Sample code:

Reference:

Improving V8 regular expressions

I have successfully built a simple sum program using the LLVM-C API in two ways. The first approach involves building LLVM IR through the API and utilizing LLVMLinkInMCJIT. To adopt this method, we need to translate our IR into LLVM IR, but we can save the overhead of converting IR into C file. The second approach is to build LLVM IR using some tools, then reading the LLVM IR and using LLVMLinkInMCJIT. However, I haven't found any LLVM front-end API to convert C into LLVM IR, I just used clang -S -emit-llvm currently.

RISC-V INSN IR -> LLVM IR -> JIT
RISC-V INSN IR -> C -> LLVM_IR -> JIT

In the context of tiered compilation, the tier-1 JIT compiler (T1C) focuses on rapid code generation while omitting certain extensions like RV32A and RV32F. Conversely, the tier-2 JIT compiler (T2C) aims to handle all execution paths and bolster optimizations, utilizing the LLVM compilation framework. This allows for the translation of code from src/rv32_template.c into corresponding LLVM code. As an illustration, consider the example below:

void addi(LLVMBuilderRef *builder,
          LLVMTypeRef *param_types UNUSED,
          LLVMValueRef start,
          rv_insn_t *ir)
{
    LLVMValueRef rs1_offset[1] = {LLVMConstInt(
        LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rs1, true)};
    LLVMValueRef addr_rs1 =
        LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
                              rs1_offset, 1, "addr_rs1");
    LLVMValueRef rd_offset[1] = {LLVMConstInt(
        LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rd, true)};
    LLVMValueRef addr_rd =
        LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
                              rd_offset, 1, "addr_rd");
    LLVMValueRef val_rs1 =
        LLVMBuildLoad2(*builder, LLVMInt32Type(), addr_rs1, "val_rs1");
    LLVMValueRef res = LLVMBuildAdd(
        *builder, val_rs1, LLVMConstInt(LLVMInt32Type(), ir->imm, true), "add");
    LLVMBuildStore(*builder, res, addr_rd);
}

The LLVM code provided can be viewed as a specialized version of the ADDI instruction semantics, which are outlined in src/rv32_template.c. This is exemplified in the RVOP macro, where the operation ADDI is defined to update a register rv->X[ir->rd] with the sum of rv->X[ir->rs1] and an immediate value ir->imm.

Then, let's check the LLVM code for JALR:

void jalr(LLVMBuilderRef *builder,
          LLVMTypeRef *param_types UNUSED,
          LLVMValueRef start,
          rv_insn_t *ir)
{
    if (ir->rd) {
        LLVMValueRef rd_offset[1] = {
            LLVMConstInt(LLVMInt32Type(),
                         offsetof(riscv_t, X) / sizeof(int) + ir->rd, true)};
        LLVMValueRef addr_rd = LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(),
                                                     LLVMGetParam(start, 0),
                                                     rd_offset, 1, "addr_rd");
        LLVMBuildStore(
            *builder, LLVMConstInt(LLVMInt32Type(), ir->pc + 4, true), addr_rd);
    }
    LLVMValueRef rs1_offset[1] = {LLVMConstInt(
        LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rs1, true)};
    LLVMValueRef addr_rs1 =
        LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
                              rs1_offset, 1, "addr_rs1");
    LLVMValueRef val_rs1 =
        LLVMBuildLoad2(*builder, LLVMInt32Type(), addr_rs1, "val_rs1");
    LLVMValueRef res1 = LLVMBuildAdd(
        *builder, val_rs1, LLVMConstInt(LLVMInt32Type(), ir->imm, true), "add");
    LLVMValueRef res2 = LLVMBuildAnd(
        *builder, res1, LLVMConstInt(LLVMInt32Type(), ~1U, true), "and");
    LLVMValueRef PC_offset[1] = {LLVMConstInt(
        LLVMInt32Type(), offsetof(riscv_t, PC) / sizeof(int), true)};
    LLVMValueRef addr_PC =
        LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
                              PC_offset, 1, "addr_PC");
    LLVMBuildStore(*builder, res2, addr_PC);
    LLVMBuildRetVoid(*builder);
}

It can also be viewed as a specialized version of the JALR instruction semantics:

RVOP(
    jalr,
    {
        const uint32_t pc = PC;
        /* jump */
        PC = (rv->X[ir->rs1] + ir->imm) & ~1U;
        /* link */
        if (ir->rd)
            rv->X[ir->rd] = pc + 4;
        rv->csr_cycle = cycle;
        rv->PC = PC;
        return true;
    },

Based on the insights gathered, I recommend using pycparser (or another appropriate C parsing package for Python) to parse the code in src/rv32_template.c. This method will facilitate the generation of LLVM-specific C code, which will serve as the foundation for the T2C. Such an approach offers the advantage of maintaining a single source for RISC-V semantics and enables the automatic generation of tiered compilation with aggressive optimizations.

Some projects using pycparser:

lc3-cc: check compile.py
Cwerg: check FrontEndC/canonicalize.py
absentee

In essence, the first Python script, tools/gen-jit-template.py, is designed to create a rapid code generator for the T1C. Meanwhile, the proposed Python script tools/gen-jit-llvm.py is tasked with generating C code utilizing LLVM-C APIs. This forms the core of the T2C, which is not only capable of advanced optimizations but also supports peephole superoptimizers (#282). The outcome is significantly improved machine code tailored to specific platforms.

Rellume is a lifter designed to convert x86-64/AArch64/RISC-V64 machine code into LLVM IR, with a strong emphasis on the performance of the resultant code. The generated LLVM IR is both recompilable and executable, offering the potential to match or even exceed the original machine code's performance. Rellume meticulously models SIMD instructions and pointers, ensuring the optimizer can generate efficient code. Capable of working with specified instructions or automatically decoding control flow, it creates LLVM-IR functions that accurately reflect the original semantics. We can consider to adopt the approach of Rellume and Instrew for converting RISC-V instructions to LLVM-IR, leveraging compiler optimizations in the process.

I have implemented a tier-2 just-in-time compiler through two approaches:

RISC-V INSN IR -> C code -> clang
RISC-V INSN IR -> LLVM IR -> LLVM JIT Based on the performance analysis (shorter is better), the performance of approach 2 is better than approach 1 in most cases. However, the performance of approach 2 is slower than approach 1 in FPEMULATION and SHA test cases.

There are several optimization strategies mentioned in Optimizing Performance Using Dynamic Code Generation that we can apply to rv32emu.

For tier-1 JIT compiler

Function Representation

A function is decoded into superblocks, which have a single entry point, but can have several exits. This is intended to allow for a code representation close to the original machine code layout and reduces the overall number of blocks. To simplify code generation, superblocks can be chained to indicate a preferred ordering and enable the omission of a terminating branch instruction by falling through to the next block. The superblocks of a function represent its CFG. Within a superblock, instructions are stored in a target-dependent format, closely following the instruction and register representation of the underlying architecture. Instructions can have a predicate for conditional execution, which is used on x86-64 for conditional jumps and moves. Drob also supports handling unmodeled instructions, for which no semantics have been specified: such an instruction is conservatively assumed to read and write all registers and modify arbitrary memory locations. Consequently, in this rather conservative rewriting mode, the optimization scope is severely limited.

Register Liveness Analysis

For several optimization passes, information about the usage of written registers is helpful, allowing Drob to detect unused effects of instructions, which can therefore be ignored, to identify entirely unused instruction sequences, which can be removed, and to find cyclic chains of unused values. Starting at the return instructions, information about used registers is propagated backwards through the instructions of all superblocks of the function until the liveness data no longer changes. For each instruction, Drob stores the set of registers which is potentially used afterwards. When further information about conditional execution or eventually written registers is available (e.g., from a previous analysis run propagating constant values), such information is incorporated as well.

For Tier2 JIT compiler

LLVM_IR based optimization

For the main optimization, the generic -O3 pass pipeline is used, extended with only three new passes specifically tailored to machine code rewriting. The generic pass sequence provided by LLVM includes passes for instruction combination, inlining, vectorization, loop unrolling (see, e.g., Figures 6.5a and 6.5b) and other (non-trivial) loop transformations, and propagating memory accesses to SSA registers where safely possible (see, e.g., Figure 6.5d). The following passes are specifically designed for DBLL.

Improved Alias Analysis
Constant Memory Regions
Folding Integer-Pointer Casts

Client–Server Architecture

This thesis provides a framework of client-server architecture, this design is similar to background compilation.

Next, we will make several further improvements:

Tier-1 JIT compiler

Implement register allocation on generated machine code .
Finding an approach to detect function.

Tier-2 JIT compiler

Trace the ELF file and generate machine code with larger range, instead of chained block collection.
Add background thread and event queue.

Based on the experimental results (see DBLL) of Optimizing Performance Using Dynamic Code Generation, the indirect jump instruction is also a target that needs to be optimized.

Ideas for Improvements.

Translation caching

Translations are cached, so that if the same block of code requires executing again, it does not need to be retranslated. The translations are stored in a 16MB cache, and pointers to them are stored in a map. Given the value of the guest instruction pointer and some flags, the cached translation to be executed (if it exists) can easily be found. This lookup is accelerated further by the use of a translation lookaside buffer (TLB), which stores pointers to the most recently used translations. If the cache becomes full, it is cleared completely.

Translations in the cache may become invalid due to the guest code on which they are based being overwritten. In order to ensure that translations are removed from the cache when this occurs, an array is maintained storing the state of each page of guest memory. If a memory write is performed to a page which contains translations, the function InvalidateCodeTranslations is called. This performs one of two actions - either it invalidates all translations from the given page, or it only invalidates those which intersect the exact memory range being written. Which occurs depends on the frequency of writes to the given page, since if many writes occur, invalidating the whole page will mean that in future, InvalidateCodeTranslations need not be called.

The translations are dependent on several states within the guest processor, for example, the default stack pointer size (16- or 32- bit). If one of these states changes, care must be taken to avoid using a translation compiled when the states were different. Since these states must be constant throughout a translation, any instructions which change these states are not added to a translation - they are emulated using interpretation.

Control transfer

Control transfer instructions necessitate a shift in execution from one translated basic block to another. When reaching the end of a basic block concluding with a control transfer instruction, the process transitions from the translated block to a helper function named JumpToNextTranslation. This function determines the next basic block for execution by consulting the current processor state and instruction pointer via a TLB. It then leaps directly to the corresponding translation if available. In cases where the translation is absent, the process reverts to the emulator to carry out the required translation.

An optimization is viable for control transfer instructions that consistently branch to a specific location, such as direct jumps and calls. Here, JumpToNextTranslationWithPatch, another helper function, is employed instead of JumpToNextTranslation. This function not only locates and jumps to the subsequent translation but also modifies the original translation to directly jump to the next translation in future executions. This modification bypasses the need for additional helper function calls. However, this patching is restricted to jumps within a single page of guest memory. This constraint exists because the destination of a direct jump can be altered to a different page by modifying the page tables. If JumpToNextTranslationWithPatch encounters a jump leading to a different page, it cannot be patched for a direct jump. Instead, it is patched to invoke JumpToNextTranslation, thus eliminating the inefficiency of a futile patching attempt in subsequent executions.

conditional jumps

Conditional jumps are processed in a manner akin to unconditional jumps. They are transformed into two separate unconditional jumps: one is executed when the condition is met, and the other when the condition is not met. These two jumps can be patched independently of each other.

Exceptions

In some cases, guest code is designed to trigger an exception, which needs to be accurately represented in the host code. This is achieved through two methods.

Firstly, certain exceptions, like page faults, are intrinsic to the guest code and must be emulated in the translations. Here, the translation includes a deliberate check for conditions that would lead to an exception. If these conditions are met, the translation logs details about the exception and returns control to the emulator immediately, without completing the rest of the translation.

Secondly, for exceptions that occur less frequently, such as divide-by-zero errors, adding a preemptive check for every division instruction would be inefficient. Instead, the division is allowed to proceed without checking the divisor. To handle any resulting divide-by-zero exceptions on the host, all translation execution is enclosed within a structured exception handling (try-catch) block.

Exceptions can interrupt the execution of a translation before it concludes. Therefore, the guest CPU state must be constantly current wherever an exception might occur. For instance, merely updating the guest’s instruction pointer at the end of a translation is inadequate, as this update might not be reached. However, updating it after every guest instruction translation would degrade performance, particularly since some guest instructions correspond to a single host instruction. A balance is struck by updating the instruction pointer only before translating a guest instruction that could potentially cause an exception. This approach still necessitates frequent updates to the instruction pointer, as any memory-accessing instruction might trigger an exception, like a page fault.

Memory access

In situations where the guest code performs memory access, the translation incorporates a call to one of many helper functions tailored to various types of memory access, sizes, and processor modes. An example of such a function is WriteDwordUserMode, which is among the more complex memory access functions.

WriteDwordUserMode is crafted in assembly to maximize speed. Its initial step is to verify that the memory access doesn't span across a page boundary. If it does, the access is divided into smaller segments. The function then utilizes a TLB to convert virtual addresses to physical addresses. Along with the physical address, the TLB entry also includes information about the status of the memory page. In scenarios where the TLB lookup is unsuccessful, or if the physical page isn't a straightforward, writable memory page (for instance, if it’s used for memory-mapped I/O or contains translations), additional functions are deployed to manage these complexities. However, if the TLB lookup is successful, the function proceeds to write the data to the guest memory, allowing the emulation to continue. WriteDwordUserMode efficiently handles this straightforward case with just a few host instructions.

It is important to note that, for efficiency, the function does not verify whether the segment limit or access rights are appropriate for the memory access being executed.

For tier-1 JIT compiler: Register Liveness Analysis For several optimization passes, information about the usage of written registers is helpful, allowing Drob to detect unused effects of instructions, which can therefore be ignored, to identify entirely unused instruction sequences, which can be removed, and to find cyclic chains of unused values. Starting at the return instructions, information about used registers is propagated backwards through the instructions of all superblocks of the function until the liveness data no longer changes. For each instruction, Drob stores the set of registers which is potentially used afterwards. When further information about conditional execution or eventually written registers is available (e.g., from a previous analysis run propagating constant values), such information is incorporated as well.

341 focuses on tracking register contents within a basic block -- a linear sequence of instructions -- allowing for the efficient reuse of variables and constants directly from registers.

After implemented register allocation for T1C and adding some LLVM opimization passes mentioned in Optimizing Performance Using Dynamic Code Generation for T2C, The performance of rv32emu with a tiered framework(T1C & T2C) surpasses that of QEMU in all scenarios except for miniz(very close).

With Chrome 114 is the start of Google beginning to roll-out Maglev as their new mid-tier compiler for further enhancing the JavaScript browser performance.

See early Maglev design document.

FEX is a sophisticated x86 emulation frontend designed to enable the execution of x86(-64) binaries on Arm64 hosts, akin to qemu-user. It introduces several innovative concepts worth exploring:

Eliminating Dead Stores: The necessity for dead store elimination arises from LLVM's occasional inability to remove unnecessary load and store operations, particularly evident in flag calculations. This becomes increasingly relevant with the introduction of the IRJIT, highlighting issues related to context-specific load and store operations.
Removing Unused Flags: The x86-64 ISA often generates numerous flags with most instructions, necessitating the removal of superfluous flag calculations that are overwritten before use. This redundancy, which often requires more effort than the operations themselves, could be mitigated by isolating flags in separate memory locations or introducing operations specifically for flag management.
Dead Code Removal: Frequently, generated code becomes obsolete immediately after its creation, especially noticeable when flag calculation is optimized away. This includes code for x86-64 instructions that output to multiple registers, only for subsequent instructions to overwrite them. Optimizations here depend on tracking the usage of data between loading and storing operations, critical for operations like multiply and divide which x86 performs with high precision.
ABI Register Elimination Pass: This optimization significantly reduces the workload for the JIT compiler by removing unnecessary stores to registers considered temporary by x86-64 ABI standards when an entire function block has been translated. Recognizing these as redundant allows for their removal, facilitating further optimizations.
Coalescing Load-Store Operations: A considerable portion of x86-64 instructions involves sequential loading or storing of registers from the context. By combining these into paired load-store operations, performance can be enhanced.
Function-Level Optimization Pass: With a complete recompilation of a function, additional optimizations become feasible, such as eliminating all final flag stores and unnecessary mid-function load-stores to the context. Optimizing for direct register mapping throughout the function eliminates the need for intermediate load-stores, streamlining execution.

These strategies underscore FEX's approach to enhancing emulation efficiency and performance through targeted optimizations.

Reference: FEXCore IR Optimization passes

After improving indirect jump for T1C, The performance of rv32emu with a tiered framework(T1C & T2C) is very close to that of QEMU and even surpasses in some cases.

QEMU vs T1 vs T2 vs T1 & T2
QEMU vs rv32emu-tiered(T1 & T2)

We still need more analysis and improvement about miniz, FP EMULATION, and bitfield benchmarks.

Here, we also compare the hardware events:

iTLB-load-misses
cache-misses
page-faults between QEMU and rv32emu.

Based on the analysis below, fast interpreter & T1C use less memory to perform dynamic binary translation than QEMU does. However, T2C uses large amount of memory to generate machine code because of the LLVM backend.