go-interpreter / wagon

wagon, a WebAssembly-based Go interpreter, for Go.
BSD 3-Clause "New" or "Revised" License
904 stars 148 forks source link

amd64 compiler optimizations: Improves speed by 30%. #120

Closed twitchyliquid64 closed 5 years ago

twitchyliquid64 commented 5 years ago
codecov-io commented 5 years ago

Codecov Report

Merging #120 into master will increase coverage by 0.04%. The diff coverage is 75.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   65.98%   66.02%   +0.04%     
==========================================
  Files          41       41              
  Lines        4066     4130      +64     
==========================================
+ Hits         2683     2727      +44     
- Misses       1116     1133      +17     
- Partials      267      270       +3
Impacted Files Coverage Δ
exec/internal/compile/backend_amd64.go 77.9% <75.75%> (-1.57%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eda5438...64664a9. Read the comment docs.

sbinet commented 5 years ago

ah, also: could you put the before/after speed improvements in one of the commit messages? (probably using the benchmarks and golang.org/x/perf/cmd/benchstat)

twitchyliquid64 commented 5 years ago

I've put the benchmark in the commit message, but I will note benchmarks will vary alot depending on the cache size and generation of processor (especially here, where we are emitting raw assembly).

For instance, the interpreted benchmark is roughly the same between my laptop (8th gen, 8mb cache) and desktop (3rd gen, 6mb cache), but the native execution benchmarks are all ~15% faster. This is largely attributable to a better uop cache, instruction fusion, larger instruction-fetch blocks, larger reorder buffer etc. These difference will get larger the further we move from reading/writing to memory (as is with the current approach, where intermediate values are written into the stack slice) to storing intermediates in registers.