feature: support monomorphization and jitting

tgolsson commented 3 years ago

This is a topic I've been thinking about for a while based on the following observations:

Value conversions take an observable amount of time in flamegraphs
Function lookup and calling takes an observable amount of time
Untyped ISA leads to an interpreter with heavy branching and in a mixed workload there's a low likelihood that the branch predictor will work well

Thus, I'd like to propose a secondary compilation stage (or maybe deferring primary compilation). In this new mode of operation, every function call becomes a generic function over the input types. Invocation leads to monomoprhization over all input value types, including receiver.

Thus, code along the lines of


fn add(a, b) {
    a + b
}

fn main() {
    let value1 = add(2, 3); // case A
    let value2 = add(2.0, 3); // case B
    let value3 = add(4, 5); // case A, reuse compilation artifacts
}

would compile the following code for Case A:

fn add<Integer, Integer>(a: Integer, b: Integer) -> Integer {
    Inst::IntOpPlus();
}

and the function call would reach this function and the interpreter execution would continue branch free. For the second case, we'd transform the current runtime error to a compilation/monomorphization error instead of a runtime error, and the final call will reuse the first compilation artifacts. This'd apply recursively, so monomorphized functions would end up calling other monomorphized, direct jump functions. This'd work for both native and Rune function calls.

There are some challenges here that I foresee, which share some overlap/complexities/incompatibilities with SSA IR. The biggest challenge by far is that not all code paths are created equal during invocation. Code that doesn't execute cannot be monomorphized, and we'll thus need to incrementally compile it as more code paths are explored. A function can therefore exist in three states: raw, partially monomoprhized, fully monomorphized, and no matter if these states are explicit or implicit we need to deal with them.

I still think this'd open up the door for a lot of other optimizations when no type information/function hashes/... needs to be looked up at runtime. Like SSA, it will likely be more impactful for a register machine than a stack machine, but either approach should benefit. I think that straight out of the box we'd see improvements for points 2 and 3 above, while point 1 likely require a bit more thought.

I tried implementing this as part of the current instruction set and tracing during execution but it became very complex to manage modification in-place. I think a more tractable approach would be to implement a streaming compiler from IR to a modified set of runtime instructions, or streaming to a lower level instruction set. This'd require some funky tricks to switch to interpretation when one hits an unknown branch and then inserting those branches back but it seems tractable.

This is mostly just a brain dump at the moment to get some initial thoughts while I'm reading papers on JIT-compilation.

tgolsson commented 3 years ago

There are essentially two major flavours of JIT compilers: tracing and method-based.

Tracing JIT compilers record the operations involved in a whole call chain and compile those, while method-based ones (which I'm more interetested in) treat each method as a separate case for compilation. The latter approach gives more consistent results with better code reuse, but leaves a lot of performance on the table as tracing /essentially/ is a full inlining of the call chain. The drawback of that is that worst-case performance of tracing is... completely un-jitted, while method-based approaches can blend much more easily.

I'll try to update this post with interesting topics:

https://carolchen.me/blog/jits-impls/ - also lots of links at tthe end

tgolsson commented 3 years ago

@udoprog asked me on Discord to write a bit to clarify on why I believe in doing SSA during JIT and not before, and my motivation is that value-induced type-changes requires either code-generation or sub-procedure monomorphization. As an example, this is the kind of code (that you shouldn't write) which will cause headaches for method-based monomorphization:


fn do_something(use_vecdeque, use_float) {
    let container = if use_vecdeque { 
        VecDeque::new() 
    } else { 
        Vec::new() 
    };
    for idx in 0..100 {
        container.push(if use_float { idx.to_float() } else { idx });
    }
    let v = container.iter().sum();
    let is_success = check_value(v);
    if is_success {
        println!("woop!");
    } else {
        panic!("too bad");
    }
}

In the above cursed code, container has the type OneOf<VecDeque, Vec> and v has the type OneOf<float, integer>. If one wants to avoid as many runtime lookups as possible, the functions needs to be rewritten to remove these.

A fairly naive and simple method is to identify spans with type-inconsistencies, and fold them into the producing branch. In case no such branch exists (or maybe always?) we produce a type-checking branch. This can be seen as a marking algorithm where we accumulate "types" into the interpreter data structure, and then modify it to ensure each item in the structure has a single color.

fn do_something(use_vecdeque, use_float) {
    // unpack Union<Vec, VecDeque>
    let v = if use_vecdeque { 
        let container = VecDeque::new();

        for idx in 0..100 {
            container.push(if use_float { idx.to_float() } else { idx });
        }
        container.iter().sum()
    } else { 
        let container = Vec::new();

        for idx in 0..100 {
            container.push(if use_float { idx.to_float() } else { idx });
        } 
        container.iter().sum()
    };
    // unpack Union<integer, float>
    let is_success if v.type == integer {
        check_value::<integer>(v.force_cast::<integer>)
    } else {
        check_value::<float>(v.force_cast::<float>)
    };
    if is_success {
        println!("woop!");
    } else {
        panic!("too bad");
    }
}

I deliberately choose the second example here to not have a clear branch-point where we can do rewrites. I'd argue that the two main parts for efficiently doing these kind of transformations is liveness analysis (to track when type mismatches occur) and graph rewrites (to do the transformations necessary to resolve the mismatch).

Another approach which might be viable is instead to lie more heavily into the monomorphization:

fn __anonymous_x123123(container, use_float) {
    for idx in 0..100 {
        container.push(if use_float { idx.to_float() } else { idx });
    }
    container.iter().sum()
}

fn do_something(use_vecdeque, use_float) {
    let container = if use_vecdeque { 
        VecDeque::new() 
    } else { 
        Vec::new() 
    };
    let v = __anonymous_x123123(container, use_float); // monomorphize over lifetime of container
    let is_success = check_value(v); // monomorphize over lifetime of v
    if is_success {
        println!("woop!");
    } else {
        panic!("too bad");
    }
}

In this case I've converted a piece of code into a function, which again seem like a much higher-level operation than what SSA cares about.

This'll rely on function calls being fast, but that is... hopefully one of the outcomes of this work. If we can make it really fast one could just split the function at each type mismatch and monomorphize the rest.

Of course, we could also just explode the moment we have type inconsistencies. I'm... not sure if that's a nice way to do it in a language without compile-time type-checks, but it's the easiest option. In that case however any IR is going to work.

rune-rs / rune

feature: support monomorphization and jitting #237