Bytecodes - Githubissues

schungx commented 4 years ago

I was bored yesterday so toyed around implementing a bytecodes compiler for Rhai.

This is a second attempt taking experience from @gmorenz's PR https://github.com/jonathandturner/rhai/pull/147

Those interested can surf to this branch: https://github.com/schungx/rhai/blob/bytecodes2/src/byte_codes.rs

It contains the compiler part only. No evaluation engine, which shouldn't be difficult once the op codes are frozen.

As I discover, in a similar manner to @gmorenz, assignments to deep object/index chains are particularly hairy. That's because the need to have some way to, sort of, push a mutable pointer to data onto the evaluation stack, which is going to make the borrow checker rather unhappy if the data pointed to resides on the stack...

On the other hand, my trials seem to point out that, even for non-trivial expressions or scripts, they compile to bytecodes (sans the interned strings dictionary) that easily fit into one or two cache lines. Most bytecodes I've tried are <128. Quite complex expressions actually compile into very short bytecode sequences (easily fits into one single cache line).

@GopherJ I'm not sure whether this may cause Rhai to execute faster for expressions-heavy usage.

On the other hand, I'm not quite sure whether statements are worth it, as they need to handle things like pushing and poping stack frames (e.g. for return, throw), unrolling loops (e.g. for break, continue), etc.

Maybe we can have a hybrid approach -- expressions are compiled into bytecodes, while statements are interpreted.

jhwgh1968 commented 4 years ago

Since I see some work is being done on this, I thought I would throw in my 2 cents.

Before I settled on what contribution to Rhai I should focus on, I considered doing this. My approach was similar to the former Psyco project in Python:

Create a transform from a compiled script to Bytecode, which was JSON. (Python already had this.)
Create an abstraction for a "call" that could surround a compiled script or other object. (Python already had this.)
When bytecode was loaded, JIT it into a block of assembly with a library like cranelift.
Create a new implementation of "call" for "JITed object".

The place I got stuck was not actually in the bytecode instruction definition. If you can think about memory management in assembly (a skill my C background gave me, but I know not everyone has), that is actually pretty straightforward.

Instead, it was how to deal with external functions and Dynamics containing Box<T> where T was opaque to Rhai. I understood the code less well, and it was somewhat different then.

schungx commented 4 years ago

Your idea of JIT-ing a Rhai function as a compilation unit is sound, especially since all Rhai functions are pure. That should make it real easy to JIT.

Actually building the bytecodes is not difficult... took just around one afternoon. I also toyed with the idea of compiling to an existing bytecode definition and using an existing VM instead. So that's not the tough part. It is probably tougher to decide on which public bytecode/VM to compile to (some people may say WASM).

As I find out, the hairy part is working out the indexing and property chains, as Rhai doesn't have a GC. In C, you can simply do array indexing math (since C arrays are simply offsets from pointers); it is the same in Rust, but unfortunately, while in C it is very easy to push a memory pointer (which is a simple machine word) onto a custom stack, in Rust you have to push a reference and the borrow checker fights you every step of the way. I basically can't make it work without going massively unsafe.

Now I think I understand why people are not writing full O/S in Rust yet.

As for dealing with Box<T> where T can be any type, it is actually less complicated that you think. In Rust, a Box<T> is *T in C, i.e. a simple pointer to a heap-allocated T. And you can treat it as such (or even cast it into an unsafe pointer *const T).

Box<dyn Any> or Box<dyn Variant> takes up two pointers, one to the heap location and one to a vtable. In reality, the JIT-ed code doesn't need to know about them. It can merrily just pass them along to any Rust function expecting them. I don't think the bytecodes will deal with anything that is an external Rust type... all those will be dealt with by registered functions.

jhwgh1968 commented 4 years ago

I basically can't make it work without going massively unsafe.

Since any JIT like this is (from Rust's perspective) massively unsafe, I presume there will be some of that no matter what.

I was going to avoid a GC by basically designing the JIT to be like Forth, with an "infinite" stack for data that can be rearranged as desired, and no heap. All operations and function calls use the Top N operands, and return any value they produce to the top of it.

Everything that is not a simple data type Cranelift can manipulate with direct machine instructions (e.g. ints and floats) would be pointers round-tripped through Box<T>. When data was given to the interpreter, it would "own" it, and it would always transfer ownership to Rust and back again.

The actual trouble I had with opaque types was my JSON formatted byte code. Because when it was loaded, how would I know whether the T type that was referred to as "already being in memory" matched what the interpreter currently has? That turned out to be quite messy.

schungx commented 4 years ago

Everything that is not a simple data type Cranelift can manipulate with direct machine instructions (e.g. ints and floats) would be pointers round-tripped through Box<T>. When data was given to the interpreter, it would "own" it, and it would always transfer ownership to Rust and back again.

That's not the difficult part. It is indexing that is the problem.

Try to think about how the following should be translated into bytecodes without GC and pointers:

let x = [ [ [ [ 1 ], 2 ], 3 ], 4 ];

x[0][0][0][0] = 42;

How is this assignment going to be represented as bytecodes without copying all intermediate data structures over and over again. And do it without the borrow check complaining.

A language without arrays/hashes should translate trivially. A language where nested arrays/objects are pointers to separate GC-collected arrays/objects also translate trivially.

jhwgh1968 commented 4 years ago

Here is how my VM would do that:

native call args=0 fn=array_create
native call args=0 fn=array_create
native call args=0 fn=array_create
native call args=0 fn=array_create
native call args=0 fn=array_create
push int 1
native call args=2 fn=array_push
push int 2
native call args=2 fn=array_push
push int 3
native call args=2 fn=array_push
push int 4
native call args=2 fn=array_push
# x is now the top of the stack

push int 0
native call args=2 fn=array_get # returns a pointer to the value
push int 0
native call args=2 fn=array_get
push int 0
native call args=2 fn=array_get
push int 0
push int 42
native call args=3 fn=array_set

How is this assignment going to be represented as bytecodes without copying all intermediate data structures over and over again? And do it without the borrow check complaining?

By copying heap pointers over and over again, if a complex value is being stored. The actual data is not moved from the heap, where it was originally boxed up on the Rust side.

Namely, the stack would consist of:

union StackItem {
    heap_or_stored: u8,
    stored: Dynamic,
    heap: *mut Dynamic
}

If we assume that it is an enum rather than a union (just because I don't remember the syntax for unions), the code for array_create would be something like this:

fn array_create(stack: Vec<StackItem>, _args: usize) -> Result<(), EvalAltResult> {
    let new_array = Box::new(Dynamic::new_list());
    stack.push(StackItem::Heap(Box::into_raw(new_array));
    Ok(())
}

And the code for array_push would look something like this:

fn array_push(stack: Vec<StackItem>, args: usize) -> Result<(), EvalAltResult> {
    let mut stack_frame = stack.drain(stack.len()-args..)
        .map(|item_or_ptr| {
            match item_or_ptr {
                StackItem::Stored(dynamic_value) => dynamic_value,
                StackItem::Heap(ptr) => Dynamic::from_boxed(unsafe { Box::from_raw(ptr) }),
            }
        });
    let array: Box<rhai::Array> = stack_frame.next().unwrap().into_boxed::<rhai::Array>();
    for arg in stack_frame {
        (*array).push(arg);
    } 
    stack.push(StackItem::Heap(Box::into_raw(array)))
    Ok(())
}

Only pointers and integers are moved, and the guarantees of Box -- that it is uniquely the holder of the data -- are maintained.

EDIT: fixed an extra re-boxing that would have been more data copying than intended.

EDIT 2: clarity on StackItem

EDIT 3: forgot about the last line

jhwgh1968 commented 4 years ago

And to be clear: there are hazards in this code, from Rust's point of view, because there are multiple pointers to the same data.

But if each data structure owns its data, then those extra pointers should only appear on the stack. That means the hazard is one the JIT compiler can simply avoid through never generating instructions that would misuse them. A "mini borrow checker", perhaps, that always enforces move semantics.

jhwgh1968 commented 4 years ago

And, finally: yes, the drain iterator in array_push does move the Dynamics off of the stack and into the collection. However, those Dynamics should themselves contain a Box<T> for complex values, so the data is not actually moved.

schungx commented 4 years ago

And to be clear: there are hazards in this code, from Rust's point of view, because there are multiple pointers to the same data.

But if each data structure owns its data, then those extra pointers should only appear on the stack. That means the hazard is one the JIT compiler can simply avoid through never generating instructions that would misuse them. A "mini borrow checker", perhaps, that always enforces move semantics.

Yes, I an in sync with you. My point about the need to be massively unsafe also.

Question: Do you think we should just compile to cranelift bytecodes? Or should we look into some other bytecodes format? Or even WASM?

jhwgh1968 commented 4 years ago

Question: Do you think we should just compile to cranelift bytecodes? Or should we look into some other bytecodes format? Or even WASM?

I think it is worth having a custom bytecode/IR format (like Python does) as a target for AST translation.

My mind went to JIT simply because there are some really good JITs these days, and the last thing I'd want is Rhai to end up significantly slower because it became an interpreted language.

When thinking about an eventual JIT, I thought of cranelift first for two basic reasons:

I think Cranelift would be easier to integrate than many other JITs. That includes LLVM, which would be the easiest path to something like WASM, to my knowledge.
I think designing a bytecode format with Cranelift as the intended target -- whether it is ever used or not -- would guide development toward good and performant design.

To expand on bullet 2 a bit:

I have only played with Cranelift a little, but I am impressed with its API. It's both ergonomic, and makes you design what you intend to feed into it in some interesting ways.

Not only is it simply and strongly typed (rather like LLVM), but every basic block can define mandatory input parameters. Because of this, every exit from a block must be explicit, point only to the beginning of another block, and pass them in.

Aside from encouraging easy-to-read patterns and making things simpler, this makes it easy to JIT code that looks like an interpreter's input: a big chain of "macro instructions", each one basic block with similar APIs, that tail call each other until the final exit.

At the same time, Cranelift's IR is not too high level. It provides "IR instructions" that are useful for things like math and array access, but also very easily turn into a small handful of machine instructions. This encourages pushing such things very close to the machine, and getting the performance you would expect a JIT to provide.

Anyway, that's my sales pitch. :smile: The point is, I think having an independent bytecode format -- containing whatever "macro instructions" make sense for the language -- comes first.

schungx commented 4 years ago

I have no experience with Cranelift, but I can give it a read...

If the Cranelift IR format is not too far off from our AST structure, then it might be simpler just to compile to that. Otherwise, we'll have yet another pass later on to compile Customer Bytecode -> Cranelift IR.

schungx commented 4 years ago

After reading into Cranelift a bit, I agree that we cannot use its bytecodes natively. It is more for a JIT. Rhai is much more high level.

Since Rhai integrates so tightly with Rust functions, if we use Cranelift, we'll need to compile those function also, otherwise the JIT-ted code needs to constantly call out to Rust (not sure if that's easy to do yet)...

jhwgh1968 commented 4 years ago

I was presuming that some of the core Rust functions could be converted into cranelift IR as well, but I agree that would tie it to Cranelift specifically.

Perhaps it's not worth it at this time. It was just an idea to keep in mind for later.

schungx commented 4 years ago

I'm thinking of a way to do this in Rust without being massively unsafe...

It seems that there is only one data structure in Rust that allows multiple mutable references to it plus self references -- and that's a stack frame. So to satisfy Rust's borrow checker, some bytecode execution needs to be recursive.

schungx commented 4 years ago

Revisiting this recently...

I am thinking, really, is there a need or even benefit to implement a bytecode VM for Rhai? Seems like Rhai currently runs OK fast (alright, not LuaJIT-fast, but benchmarking somewhere between Wren and Python)... for most purposes, as long as you don't run Rhai in a tight loop like in a game engine, it is plenty fast enough.

That is because of some design decisions in the architecture of Rhai:

1) Scope variables are kept together in a Vec instead of allocated on the heap and, importantly, not reference counted or locked (unless it is shared, that is). Reading and writing is extremely fast. 2) Most variable access is via pre-calculated offsets instead of lookup by name, making it extremely fast. 3) Most function calls are via pre-calculated hashes instead of lookup by name, making it extremely fast. 4) Functions are deliberately detached and kept pure, so there is no scope chain to search through, making variable access extremely fast. 5) Closures are implemented as automatic currying so there is no additional scope (heap allocated) to carry around, making them fast.

To me, a bytecode VM will bring the most benefit by flattening the current AST, which is allocated on the heap, into a tight byte stream which fits in a few cache lines. However, I am not sure how much speed up that's going to bring to a typical script.

Other than this, I am not seeing how bytecodes will majorly benefit the execution speed of Rhai.

I have an idea of compiling expressions into bytecode fragments (because right now the deepest part of an AST tree is to encode expressions). That may flatten most of the AST into a bunch of short byte streams. Whether this will yield significant speed improvements to make it worthwhile, however, I am not quire sure.

rhaiscript / rhai

Bytecodes #161