Open GavinRay97 opened 4 years ago
Interesting. Btw I compared fastexp
and native Math.exp
which builtin in AssemblyScript and builtin it seems faster on latest Firefox & Chrome:
native wasm exp: 152.708984375ms
fastexp: 201.573974609375ms
See benchmark fiddle: https://webassembly.studio/?f=nd524yx5pt
@MaxGraey
Huh, interesting. I get 60ms/70ms Native/fastexp on Firefox Nightly with my machine.
The original, very naive AssemblyScript code I wrote for benchmarking against JS is below:
(x
being an array of 1,000,000 floats between -1,000,000 and 1,000,000 with precision of 5 decimal places, and derivate
being a random true/false value)
@GavinRay97 Also I suggest preallocate result array:
let results = new Array<f64>(x.lenght);
instead
let results: f64[] = [];
Btw performace for exp
/log
and pow
will be increased even more soon. I found new implementations which use lookup tables. But it will be support only for -O3
opt level due to size concerns
@GavinRay97 Also I suggest preallocate result array:
let results = new Array<f64>(x.lenght);
instead
let results: f64[] = [];
Definitely. Whatever AssemblyScript or Rust code I write for benchmarking will undoubtedly be poor and sub-optimal. I have limited experience writing typed languages, and zero experience working with memory and low-level stuff like WASM.
I have no doubt that better equipped people (like yourself and @postspectacular) will be able to give feedback on how awful it is and provide much, much better implementations :joy:
Btw performace for exp/log and pow will be increased even more soon. I found new implementations which use lookup tables. But it will be support only for -O3 opt level
This is exciting! :eyes: :tada:
Also plz don't use Math.pow(x, 2)
. Currently we can't optimize this to x * x
but it will be in future. You could absolutely safe replace x ** 2
to x * x
and this increase speed even more
hey @GavinRay97 @MaxGraey - someone jumped the gun here a little bit, since I posted that gist earlier in our chat without much further comments (which were meant to be posted later).
I'm somewhat surprised to learn that the "native" exp is faster than the approximation now, since in my earlier tests on node v12.10 (MBP2010) the fastexp
version of the large sigmoid loop with 1 million iterations was almost 2x faster than the Math.exp
version of that same loop... but i also have to point out (and hoped that was clear) that this approximation was not my own impl, but taken from musicdsp.org and the point of this gist was merely to show that there're lots of possibilities the team could explore further here (incl. some of the ones Max just pointed out above :) )
I'd still recommend using raw pointers over arrays for these hottest/tightest loops, alone to avoid bounds checking and any form of indirection in these places. But again, this is my intuition from working w/ C and Max will be much better equipped to give perf tips for AssemblyScript!
@postspectacular
My apologies if you had not intended for that to be linked. The past days have done a lot of experimenting with performance and benchmarking, and I wanted to start collecting all of my notes, debug logs, code snippets, and screenshots in one place as to not forget anything.
Did not expect to get such immediate feedback :sweat_smile:
More than anything I just wanted to make sure to have saved + catalogued what you mentioned so that it did not slip my mind later.
The idea being that @christianechevarria and @luiscarbonell could look over everything and develop a concrete plan or follow-up questions.
I can remove the link to the gist.
@postspectacular fastexp
could be faster on [-1, 1] range:
native wasm exp: 306.030029296875ms
fastexp: 236.511962890625ms
@GavinRay97 - no worries, i just felt like I needed to contextualize these snippets some more... all good!
@MaxGraey - thanks for reminding me of wasmstudio :) - I've uploaded some of my code from that gist there too and the results there are still similar to what i've found in node earlier (and contradictory to yours). But the only difference between the two benchmarked fn's is their version of exp()
:
https://webassembly.studio/?f=kg1rea6qvbn
(@GavinRay97 et al - once the project sandbox has initialized, first open the browser console, then press "build & run", timing info will be shown in console only... tests take ~12 sec altogether)
The tests are currently setup to work on normalized input data, executing 500 iterations on 1 million values each. The JS glue code is in /src/main.js
. The AssemblyScript in /assembly/main.ts
.
On my laptop (MBP2015, Chrome 78) I'm getting:
bench("sigmoidApproxPtr"); // 811.406005859375ms (100 iters) // 4046.489013671875ms (500 iters)
bench("sigmoidNatPtr"); // 1504.90380859375ms (100 iters) // 7569.575927734375ms (500 iters)
Getting consistent results after multiple runs... food for thought, I think! And please, this is NO upstaging attempt, whatsoever!
@MaxGraey as mentioned above, i did actually use normalized values for this, since I assumed the values for this ML example (activation function) would be in that range, but also because the approximation has the least error in that interval (reference).
yeah, if you confident that argument always stay in [-1, 1] bounds you could use faster approximation
This document will be a work-in-progress while figuring out a game plan for how and where efforts can be best-focused for performance increases.
I have experimented with benchmarking the
squash()
andsigmoid()
/sigmoidDerivative()
functions re-written in both Rust and AssemblyScript compiled to WASM.Rust -> WASM showed very large (~x4.5) performance increases versus JS on Chrome with
squash()
given 1,000,000 inputs. On Firefox, the performance was roughly the same. Still need to benchmark Rust WASM on Node.AssemblyScript -> WASM was ~x2.5 as performant with
squash()
given 1,000,000 inputs on Node. Still need to benchmark AssemblyScript WASM on browsers.I will create a repo for the test functions and post the source code here, as well as upload screenshots from console output that have benchmark measurements soon.
@postspectacular gave some feedback and insane performance tuning on the basic implementation I had in AssemblyScript as well:
https://gist.github.com/postspectacular/3dccbfed1b753edadf1b6fee8add4808#file-03-sigmoid-simd-ts-L1
He dropped down to low-level memory/pointers, and even to SIMD instructions (yet from his comment, it seems as though there are limitations in WASM with V8 engine that dont support it quite yet). So there is obviously much room to improve when considering these benchmarks -- I am not familiar with Rust, AssemblyScript, or even low-level memory concepts on the whole. Perhaps some other users might be able to give feedback or advice here?
The best way to approach this is probably: