love-haskell / coercible-utils

Utility functions for Coercible types
https://hackage.haskell.org/package/coercible-utils
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

add initial benchmarks #10

Closed chessai closed 6 years ago

chessai commented 6 years ago
benchmarked coercible-utils/CoercibleUtils.sum: 10
time                 34.44 ns   (32.94 ns .. 36.22 ns)
                     0.974 R²   (0.958 R² .. 0.989 R²)
mean                 37.09 ns   (35.99 ns .. 40.36 ns)
std dev              5.781 ns   (2.496 ns .. 10.93 ns)
variance introduced by outliers: 80% (severely inflated)

benchmarked coercible-utils/Data.Foldable.sum:  10
time                 66.80 ns   (61.23 ns .. 70.78 ns)
                     0.973 R²   (0.952 R² .. 0.991 R²)
mean                 71.33 ns   (69.68 ns .. 72.63 ns)
std dev              5.377 ns   (3.834 ns .. 7.405 ns)
variance introduced by outliers: 51% (severely inflated)

benchmarked coercible-utils/CoercibleUtils.sum: 100
time                 309.8 ns   (304.0 ns .. 317.2 ns)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 308.9 ns   (307.3 ns .. 310.5 ns)
std dev              5.713 ns   (3.830 ns .. 7.962 ns)

benchmarked coercible-utils/Data.Foldable.sum:  100
time                 746.9 ns   (713.4 ns .. 776.1 ns)
                     0.990 R²   (0.984 R² .. 0.994 R²)
mean                 709.1 ns   (686.8 ns .. 735.2 ns)
std dev              89.66 ns   (66.11 ns .. 121.9 ns)
variance introduced by outliers: 71% (severely inflated)

benchmarked coercible-utils/CoercibleUtils.sum: 1000
time                 2.982 μs   (2.948 μs .. 3.027 μs)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 2.985 μs   (2.971 μs .. 3.007 μs)
std dev              60.24 ns   (44.72 ns .. 86.23 ns)

benchmarked coercible-utils/Data.Foldable.sum:  1000
time                 7.491 μs   (7.324 μs .. 7.709 μs)
                     0.993 R²   (0.987 R² .. 0.996 R²)
mean                 6.859 μs   (6.604 μs .. 7.052 μs)
std dev              735.6 ns   (589.1 ns .. 860.8 ns)
variance introduced by outliers: 64% (severely inflated)

benchmarked coercible-utils/CoercibleUtils.sum: 10000
time                 115.1 μs   (110.7 μs .. 119.8 μs)
                     0.992 R²   (0.978 R² .. 0.999 R²)
mean                 115.9 μs   (113.5 μs .. 121.6 μs)
std dev              12.70 μs   (3.878 μs .. 27.69 μs)
variance introduced by outliers: 64% (severely inflated)

benchmarked coercible-utils/Data.Foldable.sum:  10000
time                 73.38 μs   (72.23 μs .. 74.14 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 73.93 μs   (73.53 μs .. 74.78 μs)
std dev              1.820 μs   (1.312 μs .. 2.399 μs)

benchmarked coercible-utils/CoercibleUtils.sum: 100000
time                 1.651 ms   (1.600 ms .. 1.701 ms)
                     0.995 R²   (0.990 R² .. 0.998 R²)
mean                 1.642 ms   (1.621 ms .. 1.723 ms)
std dev              111.4 μs   (49.73 μs .. 228.6 μs)
variance introduced by outliers: 42% (moderately inflated)

benchmarked coercible-utils/Data.Foldable.sum:  100000
time                 777.4 μs   (738.1 μs .. 800.6 μs)
                     0.988 R²   (0.970 R² .. 0.997 R²)
mean                 797.0 μs   (782.6 μs .. 819.0 μs)
std dev              53.23 μs   (32.48 μs .. 80.07 μs)
variance introduced by outliers: 40% (moderately inflated)

It looks like we do better for sufficiently small lists. Somewhere between the size of 10^3 and 10^4 we start losing out.

chessai commented 6 years ago

Note that we are actually using Data.List.sum, which is defined as foldl (+) 0. It uses a lazy left-fold, but the strictness analyser should be able to determine in most cases that we are consuming the list strictly and avoid the need to build up thunks. However, I am really confused as to why we are faster for smaller lists and slower for larger ones. It might have to do with this strictness analysis kicking in?

chessai commented 6 years ago

Ah, I figured it out. I was in the right direction. Looking at the core for the 'sum' function I defined, it looks like this:

 -- RHS size: {terms: 25, types: 24, coercions: 8, joins: 1/1}
 sum1_r96e
 sum1_r96e
   = \ @ a_a4MN $dNum_a4MP eta_B1 ->
       joinrec {
         go_a6mW
         go_a6mW ds_a6mX eta1_X40
           = case ds_a6mX of {
               [] -> eta1_X40;
               : y_a6n2 ys_a6n3 ->
                 case eta1_X40 `cast` <Co:2> of nt_s6nQ { __DEFAULT ->
                 jump go_a6mW ys_a6n3 ((+ $dNum_a4MP nt_s6nQ y_a6n2) `cast` <Co:3>)
                 }
             }; } in
       jump go_a6mW
         eta_B1 ((fromInteger $dNum_a4MP $fMonoidSum1) `cast` <Co:3>)

Specifically, the line where it says:

jump go_a6mW ys_a6n3 ((+ $dNum_a4MP nt_s6nQ y_a6n2) `cast` <Co:3>)

This is building up thunks; in other words, it's lazy in the accumulator. the definition of Data.List.sum allows GHC to easily make strictness optimisations with the left-fold, so it has predictable performance. GHC is not able to perform the same strictness optimisations for our implementation with foldMap, perhaps because it cannot make the same guarantees (I don't know why.)

We can fix this with a strict definition of foldMap, like so:

foldMap' :: (Foldable t, Monoid m) => (a -> m) -> t a -> m
{-# INLINE foldMap' #-}
foldMap' f = Foldable.foldl' (\acc x -> acc `mappend` f x) mempty
chessai commented 6 years ago

Also, i changed the definition of 'sum' to be polymorphic in 'Num', to match the behaviour of Foldable precisely. This really shouldn't make a difference, but it wasn't hard to do and is strictly correct.

chessai commented 6 years ago

Fixed the benchmarks. It seems that for lazy sums, we do better for smaller lists (the size cutoff seems to be between 10^3 - 10^4, consistently.)

For strict sums, we always perform better by a noticeable (but not overwhelmingly large) margin.

Note: Although we perform better in this respect, Data.List.sum might be more amenable to list fusion.

sjakobi commented 6 years ago

Thanks. I'm not worrying too much about the performance differences between ala Sum foldMap and sum – this seems to be mostly about foldMap vs sum, not so much about ala.

But I do wonder if we should give so many examples with foldMap – maybe we should use foldMap' there and add a note regarding the performance implications.