go-interpreter / wagon

wagon, a WebAssembly-based Go interpreter, for Go.

BSD 3-Clause "New" or "Revised" License

904 stars 148 forks source link

amd64 compiler optimizations: Improves speed by 30%. #120

Closed twitchyliquid64 closed 5 years ago

twitchyliquid64 commented 5 years ago

Pairs of instructions that can be re-written into a single amd64 instruction (ie: basic operations that accept an immediate value as an operand) are rewritten into that form. For example, a i64.Const + i64.Add produce a single instruction ADDQ <register>, <immediate value>.
Opportunistically keep track of the size of the stack in a register rather than always reading from memory (The value is flushed if necessary in the postamble).
Opportunistically cache a pointer to the stack and/or local backing array in a register rather than always reading for memory. As this value is immutable we don't need to flush it back in the postamble.

codecov-io commented 5 years ago

Codecov Report

Merging #120 into master will increase coverage by 0.04%. The diff coverage is 75.75%.

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   65.98%   66.02%   +0.04%     
==========================================
  Files          41       41              
  Lines        4066     4130      +64     
==========================================
+ Hits         2683     2727      +44     
- Misses       1116     1133      +17     
- Partials      267      270       +3

Impacted Files	Coverage Δ
exec/internal/compile/backend_amd64.go	`77.9% <75.75%> (-1.57%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eda5438...64664a9. Read the comment docs.

sbinet commented 5 years ago

ah, also: could you put the before/after speed improvements in one of the commit messages? (probably using the benchmarks and golang.org/x/perf/cmd/benchstat)

twitchyliquid64 commented 5 years ago

I've put the benchmark in the commit message, but I will note benchmarks will vary alot depending on the cache size and generation of processor (especially here, where we are emitting raw assembly).

For instance, the interpreted benchmark is roughly the same between my laptop (8th gen, 8mb cache) and desktop (3rd gen, 6mb cache), but the native execution benchmarks are all ~15% faster. This is largely attributable to a better uop cache, instruction fusion, larger instruction-fetch blocks, larger reorder buffer etc. These difference will get larger the further we move from reading/writing to memory (as is with the current approach, where intermediate values are written into the stack slice) to storing intermediates in registers.