soundandform / m3

A high performance WebAssembly interpreter in C. Moved here:
https://github.com/wasm3/wasm3
MIT License
17 stars 1 forks source link
interpreter webassembly

This repository is deprecated. Development has moved here:

https://github.com/wasm3/wasm3

M3/Wasm

This is a work-in-progress WebAssembly interpreter written in C using a novel, high performance interpreter topology. The interpreter strategy (named M3) was developed prior to this particular Wasm project and is described some below.

Purpose

I don't know. I just woke up one day and started hacking this out after realizing my interpreter was well suited for the Wasm bytecode structure. Some ideas:

Current Status

Its foundation is solid but the edges are still quite rough. Many of the WebAssembly opcodes are lacking an implementation. The compilation logic is a tad unfinished. Most execution trap cases are unhandled.

Benchmarks

It's at least fleshed out enough to run some simple benchmarks.

Tested on a 4 GHz Intel Core i7 iMac (Retina 5K, 27-inch, Late 2014). M3 was compiled with Clang -Os. C benchmarks were compiled with gcc with -O3

Mandelbrot

C code lifted from: https://github.com/ColinEberhardt/wasm-mandelbrot

Interpreter Execution Time Relative to GCC
Life 547 s 133 x https://github.com/perlin-network/life
Lua 122 s 30 x This isn't Lua running some weird Wasm transcoding; a manual Lua conversion of the C benchmark as an additional reference point.
M3 17.9 s 4.4 x (3.7 x on my 2016 MacBook Pro)
GCC 4.1 s

CRC32

Interpreter Execution Time Relative to GCC
Life 153 s 254 x
M3 5.1 s 8.5 x
GCC 600 ms

In general, the M3 strategy seems capable of executing code around 4-15X slower than compiled code on a modern x86 processor. (Older CPUs don't fare as well. I suspect branch predictor differences.) I have yet to test on anything ARM.

Building

There's only an Xcode project file currently.

M3: Massey Meta Machine

Over the years, I've mucked around with creating my own personal programming language. It's called Gestalt. The yet unreleased repository will be here: https://github.com/soundandform/gestalt

Early on I decided I needed an efficient interpreter to achieve the instant-feedback, live-coding environment I desire. Deep traditional compilation is too slow and totally unnecessary during development. And, most importantly, compilation latency destroys creative flow.

I briefly considered retooling something extant. The Lua virtual machine, one of the faster interpreters, is too Lua centric. And everything else is too slow.

I've also always felt that the "spin in a loop around a giant switch statement" thing most interpreters do was clumsy and boring. My intuition said there was something more elegant to be found.

The structure that emerged I named a "meta machine" since it mirrors the execution of compiled code much more closely than the switch-based virtual machine.

I set out with a goal of approximating Lua's performance. But this interpreter appears to beat Lua by 3X and more on basic benchmarks -- whatever they're worth!

How it works

This rough information might not be immediately intelligible without referencing the source code.

Reduce bytecode decoding overhead

Tightly Chained Operations

The End Result

Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers. It's not virtual! Instead, it's a meta machine with very low execution impedance.

M3 Register x86 Register
program counter (pc) rdi
stack pointer (sp) rsi
linear memory (mem) rdx
integer register (r0) rcx
floating-point register (fp0) xmm0

For example, here's a bitwise-or operation in the M3 compiled on x86.

m3`op_u64_Or_sr:
    0x1000062c0 <+0>:  movslq (%rdi), %rax             ; load operand stack offset  
    0x1000062c3 <+3>:  orq    (%rsi,%rax,8), %rcx      ; or r0 with stack operand
    0x1000062c7 <+7>:  movq   0x8(%rdi), %rax          ; fetch next operation
    0x1000062cb <+11>: addq   $0x10, %rdi              ; increment program counter
    0x1000062cf <+15>: jmpq   *%rax                    ; jump to next operation

Registers and Operational Complexity

Stack Usage

Thoughts & Questions about WebAssembly

There are some really fundamental things about WebAssembly that I don't understand yet. There are a bunch of mysteries about this thing that don't seem sufficiently explained in plain English anywhere. Can someone orient me?

Some examples:

The M3 strategy for other languages (is rad)

The Gestalt M3 interpreter works slightly differently than this Wasm version. With Gestalt, blocks of all kind (if/else/try), not just loops, unwind the native stack. (This somewhat degrades raw x86 performance.)

But, this adds a really beautiful property to the interpreter. The lexical scoping of a block in the language source code maps directly into the interpreter. All opcodes/operations end up having an optional prologue/epilogue structure. This made things like reference-counting objects in Gestalt effortless. Without this property, the compiler itself would have to track scope and insert dererence opcodes intentionally. Instead, the "CreateObject" operation is also the "DestroyObject" operation on the exit/return pathway.

Here's some pseudocode to make this more concrete:

return_t Operation_NewObject (registers...)
{
   Object o = runtime->CreateObject (registers...);

   * stack [index] = o;

   return_t r = CallNextOperation (registers...);  // executes to the end of the scope/block/curly-brace & returns

   if (o->ReferenceCount () == 0)
       runtime->DestroyObject (registers..., o);   // calls o's destructor and frees memory

   return r;
}

Likewise, a "defer" function (like in Go) becomes absolutely effortless to implement. Exceptions (try/catch) as well.