Open rreusser opened 8 years ago
The difference, illustrated:
The straightforward recursive algorithm:
1 2 3 4 5 6 7 8 9 10
\ / \ \ / \ / \ \ /
3 3 9 13 \ 19
\ \ / \ \ /
\ 12 \ 27
\ / \ /
15 40
\ /
\ /
55
My non-recursive algorithm:
1 2 3 4 5 6 7 8 9 10
\ / \ / \ / \ / \ /
3 7 11 15 19
\ / \ / /
10 26 19
\ / /
36 19
\ /
55
Can rewrite, but it's nontrivial. 😞
@tab58 curious to hear your thoughts on whether a factor of ~9 speedup (62 -> 540) is negated by this shortcoming.
I've run the block1PairwiseSum
summation algorithm with trace-deopt
on, and it's getting deoptimized for some reason. The reason is: Insufficient type feedback for keyed load
. I think it means there's not enough information for v8 to do a type inference on the array being passed in, but I'm not sure about it or how to fix that. Here's the trace information:
https://gist.github.com/tab58/84174d1123629bc61b9a6859939bfce6
Good call. Not sure the cause, but that's basically the answer at least.
More generally, I think the right answer is going to be recursive calls in blocks of four or eight, which will probably more than compensate for the function calls, although trees with power of two did end up being pretty straightforward.
Sorry that the last post was a bit of a non-sequitur. I don't think it's the tree structure that would slow it down, but the number of function calls. I think the recursive function is hitting the wall by the cost of each call. I've heard for a while now that function calls are expensive, but I've never really seen that in my own code. I think the non-recursive function would yield a good amount of speedup, but I want to see if the optimized JIT compiler flags it and deoptimizes it. If that happens, then there won't be much gain from it IMO.
Hmm… working through the different versions and trying to determine if and why they're optimized/deoptimized… I'm not particularly good at reading the output…
Same here. I'm trying to find anything I can on how to parse through the output.
See new branch: https://github.com/rreusser/summation-algorithms/tree/trace-optimization
Run with $ npm run trace
I used the optimized
module to directly detect whether a function is optimized. They both seem to be. I'm not totally sure how to read the raw --trace-deopt
output, but it seems like they should both be optimized since I'm not aware of any direct violations of optimization killers. (But the raw output is suspicious, in addition to being cryptic and opaque.)
Oh, and should add I did the basic try/catch sanity check to verify that the answer to "optimized?" seems at least meaningful. (Answer: yes, try/catch kills optimization)
So I ran the new branch and I increased the number of elements in the array up to 200000. What's funny is that the same deoptimization message in the gist comes up right before the optimized
output reporting that the function is optimized.
Yeah, it made me a little suspicious that the raw output is more tricky to read. For one, V8 doesn't optimize until the function has been run a few times and marked hot, right? Note that the module version runs it, marks it for optimization, and then runs it again: https://github.com/node-modules/optimized/blob/master/optimized.js#L69-L91
@tab58 I removed the %OptimizeFunctionOnNextCall
directive in optimized
and ran the algorithms repeatedly with a short Float64Array input to see how long it takes before they get optimized. I think one answer here is that the algorithms don't get optimized until they actually take up a significant amount of time, so --trace-opt
after a couple runs may not be telling the full story. The plot below shows that all seven algorithms don't automatically get optimized until run 700 or so (again, this is for a short input so a million-length input may get optimized sooner).
I modified the benchmark in the trace-optimization
branch to optimize the algorithms right away. The results suggest that the original benchmarks reflect optimized functions since the results don't differ from what we saw before:
╰─± npm run benchmark
> summation-algorithms@1.0.0 benchmark /Users/rreusser/projects/node/scijs/kahan
> node --allow-natives-syntax benchmark/index.js
Function sumPairwiseRecursive({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock1({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock2({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock4({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock8({"0":1,"1":2,"2":3}) is optimized
Function sumKahan({"0":1,"1":2,"2":3}) is optimized
Function sumSerial({"0":1,"1":2,"2":3}) is optimized
Pairwise summation x 61.20 ops/sec ±1.03% (62 runs sampled)
Non-recursive pairwise summation (blocksize = 1) x 85.32 ops/sec ±1.01% (71 runs sampled)
Non-recursive pairwise summation (blocksize = 2) x 171 ops/sec ±0.82% (84 runs sampled)
Non-recursive pairwise summation (blocksize = 4) x 320 ops/sec ±0.84% (87 runs sampled)
Non-recursive pairwise summation (blocksize = 8) x 502 ops/sec ±1.03% (87 runs sampled)
Kahan summation x 228 ops/sec ±1.00% (86 runs sampled)
Naive summation x 456 ops/sec ±1.65% (87 runs sampled)
Fastest function is Non-recursive pairwise summation (blocksize = 8).
True value: 3.523541885276552e+21
Serial error: 22020096
Pairwise error: 1048576
Non-recursive pairwise blocksize=1 error: 0
Non-recursive pairwise blocksize=2 error: 0
Non-recursive pairwise blocksize=4 error: 0
Non-recursive pairwise blocksize=8 error: 0
Kahan error: 0
Function sumPairwiseRecursive({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock1({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock2({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock4({"0":1,"1":2,"2":3}) is optimized
Function sumPairwiseBlock8({"0":1,"1":2,"2":3}) is optimized
Function sumKahan({"0":1,"1":2,"2":3}) is optimized
Function sumSerial({"0":1,"1":2,"2":3}) is optimized
Long story short, it still appears pairwise summation is kinda slow (though again, I suspect that blocking the sums could speed it up quite a bit).
The trees are slightly different for the recursive and non-recursive algorithms. The non-recursive algorithm works in terms of powers of two. The recursive algorithm splits the tree roughly in half. It's not clear to me whether this has a meaningful effect on the stability of the computed result. Presumably the elements that fall through have slightly different numerical properties than the elements that percolate through the whole tree, which is an unpleasant and maybe prohibitive asymmetry.