Open 414owen opened 4 years ago
Main problem is of course performance impact. We go from 3-words/KBN to 5-words/KBN + pointer chasing. And then there's question what is impact of this change. Benchmark result (with -O2) are exactly same so I suppose GHC is smart enough to just unbox everything in main loop and don't allocate Kahan/KBN/KB2 objects at all.
Do benchmark with -O1 (could be changed in cabal file) continue to show no difference?
P.S. Out of curiosity. How exactly do you use compensated summation with ad?
Okay, here are some ~more complete benchmarks~
I also dumped the assembly for some minimal examples (old, new).
The new ASM doesn't look great, but I'm not really in a position to evaluate it. The benchmarks don't seem to care too much...
Here are some more easily comparable benchmarks: https://gist.github.com/414owen/ea366fc110a4e416ae9ceea035689a03
The new generalized version with -O2
is slightly faster than the current version in every test.
I don't know why this is the case.
I looked into core and yes GHC unboxes KBN/etc accumulators so inner loop has type Double# -> Double# -> Int# -> Double
. It means it doesn't matter whether accumulator is boxed or not. On one hand it's pretty much ideal case for optimizer: simple loop where everything is inlined. On other it's more or less how accumulator is intended to be used.
Question turns out to be how easy to break optimization above? I can't invent one on the spot so I need to think a bit about it.
The new ASM doesn't look great, but I'm not really in a position to evaluate it.
AFAIK NCG never was among best so it could generate suboptimal assembly
I'm passing grad, and optimize a function ...
And how does KBN & friends enter the picture?
I've added summation benchmark which inhibits inlining and requires compiler to actually allocate KBN objects on heap:
kbnStep :: Sum.KBNSum Double -> Double -> Sum.KBNSum Double
kbnStep = Sum.add
{-# NOINLINE kbnStep #-}
...
, bench "kbn.Noinline" $ whnf (Sum.kbn . U.foldl' kbnStep Sum.zero) v
...
Results are to say the least surprising:
Sum/kbn 3.704 ms
Sum/kbn.Noinline [unboxed] 8.467 ms
Sum/kbn.Noinline [boxed] 5.871 ms (!)
For some reason boxed version outperforms unboxed one. I didn't tried to look why it's the case. It looks like making accumulator types boxed doesn't result in large performance penalty and for good performance everything must be inlined anyway
Another possible related puzzle is kb2 outperforming everything else including naive summation. And this happens despite kb2 doing much ore work.
This PR turns into digging into benchmarking weirdness but it's difficult to make performance sensitive choices when benchmarks lie to you. It turns out that benchmark results depend on order in which they're run:
Sum/naive mean 3.632 ms ( +- 30.21 μs )
Sum/kahan mean 5.437 ms ( +- 19.12 μs )
Sum/kb2 mean 2.310 ms ( +- 17.17 μs )
Sum/kbn mean 1.566 ms ( +- 8.775 μs )
Sum/naive mean 3.621 ms ( +- 24.39 μs )
Sum/kahan mean 5.476 ms ( +- 20.11 μs )
Sum/kbn mean 3.712 ms ( +- 28.19 μs )
Sum/kb2 mean 2.284 ms ( +- 19.84 μs )
When kbn is run lasts its run time goes from 3.7ms to 1.6ms! It's more than 2x speedup! Something weird is going on here.
I also attempted to measure run time of kbn/kb2 summation using perf tools (gist here). Results are very boring and in line with what's expected: kbn — 1.9 slowdown and kb2 — 2.7 slowdown.
I'm trying to gauge interest in upstreaming some more generalized floating point compensated summations. This is useful to me for use with the
ad
library, wheregrad
takes a function(Traversable f, Num a) => f (Reverse s a) -> Reverse s a)
. I use theRealFloat
instance ofReverse s a
to implement a loss function, which includes compensated floating point arithmetic.Currently all tests pass apart from GHCJS 8.4. I'll look into it if there's enough interest in merging this.
I've run some preliminary benchmarks, which seem very promising, and I think establish that any potential data boxing coming from this generalization won't do too much harm to a user.
edit:
<removed obsolete benchmarks>