Parameterize summations over RealFloats

414owen commented 4 years ago

I'm trying to gauge interest in upstreaming some more generalized floating point compensated summations. This is useful to me for use with the ad library, where grad takes a function (Traversable f, Num a) => f (Reverse s a) -> Reverse s a). I use the RealFloat instance of Reverse s a to implement a loss function, which includes compensated floating point arithmetic.

Currently all tests pass apart from GHCJS 8.4. I'll look into it if there's enough interest in merging this.

I've run some preliminary benchmarks, which seem very promising, and I think establish that any potential data boxing coming from this generalization won't do too much harm to a user.

edit: <removed obsolete benchmarks>

Shimuuar commented 4 years ago

Main problem is of course performance impact. We go from 3-words/KBN to 5-words/KBN + pointer chasing. And then there's question what is impact of this change. Benchmark result (with -O2) are exactly same so I suppose GHC is smart enough to just unbox everything in main loop and don't allocate Kahan/KBN/KB2 objects at all.

Do benchmark with -O1 (could be changed in cabal file) continue to show no difference?

P.S. Out of curiosity. How exactly do you use compensated summation with ad?

414owen commented 4 years ago

Okay, here are some ~more complete benchmarks~

I also dumped the assembly for some minimal examples (old, new).

The new ASM doesn't look great, but I'm not really in a position to evaluate it. The benchmarks don't seem to care too much...

414owen commented 4 years ago

I'm passing grad, and optimize a function (Traversable f, RealFloat b) => f b -> b, where b ~ Reverse s a in grad's type. Elsewhere, I use the function where b ~ Double.

414owen commented 4 years ago

Here are some more easily comparable benchmarks: https://gist.github.com/414owen/ea366fc110a4e416ae9ceea035689a03

The new generalized version with -O2 is slightly faster than the current version in every test. I don't know why this is the case.

Shimuuar commented 4 years ago

I looked into core and yes GHC unboxes KBN/etc accumulators so inner loop has type Double# -> Double# -> Int# -> Double. It means it doesn't matter whether accumulator is boxed or not. On one hand it's pretty much ideal case for optimizer: simple loop where everything is inlined. On other it's more or less how accumulator is intended to be used.

Question turns out to be how easy to break optimization above? I can't invent one on the spot so I need to think a bit about it.

The new ASM doesn't look great, but I'm not really in a position to evaluate it.

AFAIK NCG never was among best so it could generate suboptimal assembly

I'm passing grad, and optimize a function ...

And how does KBN & friends enter the picture?

Shimuuar commented 4 years ago

I've added summation benchmark which inhibits inlining and requires compiler to actually allocate KBN objects on heap:

kbnStep :: Sum.KBNSum Double -> Double -> Sum.KBNSum Double
kbnStep = Sum.add
{-# NOINLINE kbnStep #-}

...
      , bench "kbn.Noinline" $ whnf (Sum.kbn . U.foldl' kbnStep Sum.zero) v
...

Results are to say the least surprising:

Sum/kbn                     3.704 ms
Sum/kbn.Noinline [unboxed]  8.467 ms
Sum/kbn.Noinline [boxed]    5.871 ms (!)

For some reason boxed version outperforms unboxed one. I didn't tried to look why it's the case. It looks like making accumulator types boxed doesn't result in large performance penalty and for good performance everything must be inlined anyway

Another possible related puzzle is kb2 outperforming everything else including naive summation. And this happens despite kb2 doing much ore work.

Shimuuar commented 4 years ago

This PR turns into digging into benchmarking weirdness but it's difficult to make performance sensitive choices when benchmarks lie to you. It turns out that benchmark results depend on order in which they're run:

Sum/naive                                mean 3.632 ms  ( +- 30.21 μs  )
Sum/kahan                                mean 5.437 ms  ( +- 19.12 μs  )
Sum/kb2                                  mean 2.310 ms  ( +- 17.17 μs  )
Sum/kbn                                  mean 1.566 ms  ( +- 8.775 μs  )

Sum/naive                                mean 3.621 ms  ( +- 24.39 μs  )
Sum/kahan                                mean 5.476 ms  ( +- 20.11 μs  )
Sum/kbn                                  mean 3.712 ms  ( +- 28.19 μs  )
Sum/kb2                                  mean 2.284 ms  ( +- 19.84 μs  )

When kbn is run lasts its run time goes from 3.7ms to 1.6ms! It's more than 2x speedup! Something weird is going on here.

I also attempted to measure run time of kbn/kb2 summation using perf tools (gist here). Results are very boring and in line with what's expected: kbn — 1.9 slowdown and kb2 — 2.7 slowdown.

haskell / math-functions

Parameterize summations over RealFloats #64