remko / waforth

Small but complete dynamic Forth Interpreter/Compiler for and in WebAssembly
https://mko.re/waforth
MIT License
499 stars 28 forks source link

TOS caching #49

Closed jacereda closed 2 years ago

jacereda commented 2 years ago

Hi,

I was quite surprised to see such a difference with gforth. My intuition is that it is due to the lack of TOS caching. Have you considered that approach?

remko commented 2 years ago

Hi @jacereda

Where did you see the difference? Did you do your own measurements, or are you referring to the toy tests I once did? If the latter, I wouldn't trust those results any more:

That said, my gut feeling would say that Gforth should be faster. The subroutine threading using indirect calls used in WAForth is probably the biggest overhead, although it could be that Gforth has more control over other things such as where to keep the TOS as well.

I don't know what you mean concretely by TOS caching. At the time of those benchmarks, the TOS was stored as an (unexported) mutable global variable. This means that changing the TOS from within a compiled word needed a call into the main module, which is obviously expensive. Switching to exported mutable globals (so the global TOS var could be manipulated from within a compiled word) didn't help much, and caused some significant slowdowns in some browsers even (maybe some protection mechanism). I changed the design to replace the TOS global variable by a first parameter being threaded through every call, and returned by every call. This also allowed me to inline any pushes and pops, which gave a total speedup of 43% by my very simple benchmark.

I haven't had the time to figure out how to look at the generated machine code by the WASM engines. Ideally, the TOS would be in a register most of the time, but I would be surprised if this was the case. I think that by threading the TOS as a first parameter, I put it as close to the execution path as possible without having any control over where variables are located, but I can only guess.

I'm planning to have an optional (external) postprocessor that combines all WAForth-compiled WASM modules into one big module, and replaces indirect calls with direct calls. This would get rid of all the overhead due to indirect calls, but it will probably still not be as efficient as direct threading.

jacereda commented 2 years ago

The benchmark I saw was reposted today at HN: https://news.ycombinator.com/item?id=32517274 By TOS caching I mean something like this:

OVER:
  *--dsp  = tos;
  tos = dsp[1];
  NEXT;

DUP:
  *--dsp = tos;
  NEXT;

That is, instead of just using a data stack pointer, you also have a top of stack register containing the topmost item.

Also, have you considered subroutine threading? I guess WASM would probably prefer that to direct/indirect threading.

remko commented 2 years ago

Yeah, those benchmarks are outdated.

In WebAssembly, you don't have any control over registers.

As mentioned in the Design document, subroutine threading is the only alternative in WebAssembly. Direct/indirect threading are not possible due to the design of WebAssembly. You could implement a static dispatching to built-in words using something like br_table, but even if that would end up being more efficient (which would depend on the JIT compiler), it would still not work with compiled words, and you'd end up with 2 execution mechanisms (which is too much complexity for me).

jacereda commented 2 years ago

You don't need to control the registers to have something like that. Each word can probably receive a pair of values (TOS and DSP) and return those as well. With the C ABI, this usually works quite well on most architectures except x86/64, where it involves some registers moving because the register calling scheme is quite stupid. That is, on most RISC machines, if your word takes a struct containing TOS/DSP and returns a struct with the updated values, those will probably end up in %r0 and %r1. On x64 the input struct will probably end up in %ecx/%edx and must be returned in %eax/%ebx IIRC. That leads to some register shuffling, but if WASM doesn't adhere to the C ABI and has sane calling conventions this approach could be really fast.

remko commented 2 years ago

I believe that's more or less what happens today (not at the time of the benchmarks).

For example, this is the implementation of the DROP word:

(func $DROP (param $tos i32) (result i32)
    (i32.sub (local.get $tos) (i32.const 4)))

The TOS is passed as a parameter, and the new TOS is returned as a return value. All words have the same signature.

A word defined in terms of other words looks like this:

(func $2@  (param $tos i32) (result i32)
    (local.get $tos)
    (call $DUP)
    (call $CELL+)
    (call $@)
    (call $SWAP)
    (call $@))

(Because WebAssembly's computational model is built around an implicit operand stack, the TOS can be implicitly threaded through all calls). Note that the above is a word defined in the core. Compiled words will use call_indirect, which is the cause of much overhead.