Performance of Generic-based instances.

As perhaps obvious from my other recent tickets I've been looking at how Binary gets compiled as a matter of investigating GHC performance.

While looking at the resulting code I found that generic-based instances generally don't fully optimize away the overhead of generics.

In particular I've looked at slight variations of this code:

{-# LANGUAGE DeriveGeneric #-}

{-# OPTIONS_GHC #-}
module Main
  ( -- PathComponent(..)
   main
  ) where

import GHC.Generics
import Data.Typeable
import Data.Binary
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as BS

data PathTemplateVariable =

       Var0
     | Var1
     | Var2
     | Var3
     | Var4
     | Var5
     | Var6
     | Var7
     | Var8
     | Var9
  deriving (Generic,Enum)

instance Binary PathTemplateVariable

main :: IO ()
main = do
  let lists = replicate 5000000 Var0
      lbs = encode lists
  print $ BS.length $ BS.toStrict lbs

I found that for an expression like this (and having split the deriving into it's own module):

foo :: PathTemplateVariable -> BL.ByteString
foo x = encode  x

It results in this kind of core on ghcs master branch currently:

Main.foo
  = \ (x_a8bn :: PathTemplateVariable) ->
      B.toLazyByteString
        (case x_a8bn of {
           Var0 ->
             case Derive.$fBinaryPathTemplateVariable79 `cast` <Co:2> :: .. of
             { Data.Binary.Put.PairS ds_a8q5 b_a8q6 ->
             b_a8q6
             };
           Var1 ->
             case Derive.$fBinaryPathTemplateVariable75 `cast` <Co:2> :: .. of
             { Data.Binary.Put.PairS ds_a8q5 b_a8q6 ->
             b_a8q6
             };
        ....

Which looks fine assuming the code in the Derive module we call is just the "put" method for each constructor. But sadly instead these methods all end up calling the generic put method (but at least with a statically computed generic representation of the individual constructor).

-- RHS size: {terms: 6, types: 98, coercions: 100, joins: 0/0}
Derive.$fBinaryPathTemplateVariable79 :: Put
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 60 0}]
Derive.$fBinaryPathTemplateVariable79
  = binary-0.8.9.0-inplace:Data.Binary.Generic.$w$cgput
      @((C1 ('MetaCons "Var0" 'PrefixI 'False) U1
         :+: C1 ('MetaCons "Var1" 'PrefixI 'False) U1)
        :+: (C1 ('MetaCons "Var2" 'PrefixI 'False) U1
             :+: (C1 ('MetaCons "Var3" 'PrefixI 'False) U1
                  :+: C1 ('MetaCons "Var4" 'PrefixI 'False) U1)))
      @((C1 ('MetaCons "Var5" 'PrefixI 'False) U1
         :+: C1 ('MetaCons "Var6" 'PrefixI 'False) U1)
        :+: (C1 ('MetaCons "Var7" 'PrefixI 'False) U1
             :+: (C1 ('MetaCons "Var8" 'PrefixI 'False) U1
                  :+: C1 ('MetaCons "Var9" 'PrefixI 'False) U1)))
      (Derive.$fBinaryPathTemplateVariable35 `cast` <Co:50> :: ..)
      (Derive.$fBinaryPathTemplateVariable26 `cast` <Co:50> :: ..)
      5##64
      5##64
      @ghc-prim:GHC.Types.Any
      Derive.$fBinaryPathTemplateVariable80 -- This is the generic representation of Var0

For runtime performance the issue here is that $w$cgput doesn't get inlined. And indeed for a regular function it's rather large so it not being inlined is not unexpected. But we could force it to inline trivially by adding INLINE pragmas on the methods in inData.Binary.Generic.

And indeed I tried this and for encoding the example data type above via it's generic instance allocations at runtime went down by around a third and runtime similarly improved significantly (but I didn't take exact measurements for runtime).

This isn't, sadly, enough for complete elimination of overhead. The resulting partial specialized in pseudo code looks something like:

put con =
  let put_con x =
      case x of
        L x' -> case x' of
          L x'' -> putWord 0
          R x'' -> putWord 1
        R x' -> case x' of
          L x'' -> ...
  in case x of
    Var0 -> put_con generic_var0_rep
    Var1 -> put_con generic_var1_rep
    ...

If put_con would inline it would cancel out with generic_var0_rep, same if it where to get specialized by SpecConstr but since it's non-recursive this doesn't happen either.

Not that nothing about this is allocating so this is a good win over the current behaviour. But there are a lot of conditional branches taken in order to compute the encoding and the overhead is also larger with larger types.

Maybe this too can be fixed with with some well place INLINE pragmas or inline applications.

Marking gput for :+: as INLINE[0] with the rest marked as INLINE seems to fix this. I assume this is since this way we end up inlining what amounts to put_con since it's small early on. In phase zero we then inline the gput instance for product representations and this shortens out with the generic representation.

That is we use

instance ( GSumPut  a, GSumPut  b
         , SumSize    a, SumSize    b) => GBinaryPut (a :+: b) where
    {-# INLINE[0] gput #-}

with all other generic put instances being INLINE. The only question is then how big the impact of this would be on compile time.

Here is the full fast version of the Data.Binary.Generic code for future reference. I might or might not go through the motions of putting up a patch seeing how there are some very old ones with no activity.

{-# LANGUAGE BangPatterns, CPP, FlexibleInstances, KindSignatures,
    ScopedTypeVariables, TypeOperators, TypeSynonymInstances #-}
{-# LANGUAGE Safe #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}

#if __GLASGOW_HASKELL__ >= 800
#define HAS_DATA_KIND
#endif

-----------------------------------------------------------------------------
-- |
-- Module      : Data.Binary.Generic
-- Copyright   : Bryan O'Sullivan
-- License     : BSD3-style (see LICENSE)
--
-- Maintainer  : Bryan O'Sullivan <bos@serpentine.com>
-- Stability   : unstable
-- Portability : Only works with GHC 7.2 and newer
--
-- Instances for supporting GHC generics.
--
-----------------------------------------------------------------------------
module Data.Binary.Generic
    (
    ) where

import Control.Applicative
import Data.Binary.Class
import Data.Binary.Get
import Data.Binary.Put
import Data.Bits
import Data.Word
#if !MIN_VERSION_base(4,11,0)
import Data.Monoid ((<>))
#endif
#ifdef HAS_DATA_KIND
import Data.Kind
#endif
import GHC.Generics
import Prelude -- Silence AMP warning.

-- Type without constructors
instance GBinaryPut V1 where
    {-# INLINE gput #-}
    gput _ = pure ()

instance GBinaryGet V1 where
    gget   = return undefined

-- Constructor without arguments
instance GBinaryPut U1 where
    {-# INLINE gput #-}
    gput U1 = pure ()

instance GBinaryGet U1 where
    gget    = return U1

-- Product: constructor with parameters
instance (GBinaryPut a, GBinaryPut b) => GBinaryPut (a :*: b) where
    {-# INLINE gput #-}
    gput (x :*: y) = gput x <> gput y

instance (GBinaryGet a, GBinaryGet b) => GBinaryGet (a :*: b) where
    gget = (:*:) <$> gget <*> gget

-- Metadata (constructor name, etc)
instance GBinaryPut a => GBinaryPut (M1 i c a) where
    {-# INLINE gput #-}
    gput = gput . unM1

instance GBinaryGet a => GBinaryGet (M1 i c a) where
    gget = M1 <$> gget

-- Constants, additional parameters, and rank-1 recursion
instance Binary a => GBinaryPut (K1 i a) where
    {-# INLINE gput #-}
    gput = put . unK1

instance Binary a => GBinaryGet (K1 i a) where
    gget = K1 <$> get

-- Borrowed from the cereal package.

-- The following GBinary instance for sums has support for serializing
-- types with up to 2^64-1 constructors. It will use the minimal
-- number of bytes needed to encode the constructor. For example when
-- a type has 2^8 constructors or less it will use a single byte to
-- encode the constructor. If it has 2^16 constructors or less it will
-- use two bytes, and so on till 2^64-1.

#define GUARD(WORD) (size - 1) <= fromIntegral (maxBound :: WORD)
#define PUTSUM(WORD) GUARD(WORD) = putSum (0 :: WORD) (fromIntegral size)
#define GETSUM(WORD) GUARD(WORD) = (get :: Get WORD) >>= checkGetSum (fromIntegral size)

instance ( GSumPut  a, GSumPut  b
         , SumSize    a, SumSize    b) => GBinaryPut (a :+: b) where
    {-# INLINE[0] gput #-}
    gput | PUTSUM(Word8) | PUTSUM(Word16) | PUTSUM(Word32) | PUTSUM(Word64)
         | otherwise = sizeError "encode" size
      where
        size = unTagged (sumSize :: Tagged (a :+: b) Word64)

instance ( GSumGet  a, GSumGet  b
         , SumSize    a, SumSize    b) => GBinaryGet (a :+: b) where
    gget | GETSUM(Word8) | GETSUM(Word16) | GETSUM(Word32) | GETSUM(Word64)
         | otherwise = sizeError "decode" size
      where
        size = unTagged (sumSize :: Tagged (a :+: b) Word64)

sizeError :: Show size => String -> size -> error
sizeError s size =
    error $ "Can't " ++ s ++ " a type with " ++ show size ++ " constructors"

------------------------------------------------------------------------

checkGetSum :: (Ord word, Num word, Bits word, GSumGet f)
            => word -> word -> Get (f a)
checkGetSum size code | code < size = getSum code size
                      | otherwise   = fail "Unknown encoding for constructor"
{-# INLINE checkGetSum #-}

class GSumGet f where
    getSum :: (Ord word, Num word, Bits word) => word -> word -> Get (f a)

class GSumPut f where
    putSum :: (Num w, Bits w, Binary w) => w -> w -> f a -> Put

instance (GSumGet a, GSumGet b) => GSumGet (a :+: b) where
    getSum !code !size | code < sizeL = L1 <$> getSum code           sizeL
                       | otherwise    = R1 <$> getSum (code - sizeL) sizeR
        where
          sizeL = size `shiftR` 1
          sizeR = size - sizeL

instance (GSumPut a, GSumPut b) => GSumPut (a :+: b) where
    {-# INLINE putSum #-}
    putSum !code !size s = case s of
                             L1 x -> putSum code           sizeL x
                             R1 x -> putSum (code + sizeL) sizeR x
        where
          sizeL = size `shiftR` 1
          sizeR = size - sizeL

instance GBinaryGet a => GSumGet (C1 c a) where
    getSum _ _ = gget

instance GBinaryPut a => GSumPut (C1 c a) where
    {-# INLINE putSum #-}
    putSum !code _ x = put code <> gput x

------------------------------------------------------------------------

class SumSize f where
    sumSize :: Tagged f Word64

#ifdef HAS_DATA_KIND
newtype Tagged (s :: Type -> Type) b = Tagged {unTagged :: b}
#else
newtype Tagged (s :: * -> *)       b = Tagged {unTagged :: b}
#endif

instance (SumSize a, SumSize b) => SumSize (a :+: b) where
    sumSize = Tagged $ unTagged (sumSize :: Tagged a Word64) +
                       unTagged (sumSize :: Tagged b Word64)

instance SumSize (C1 c a) where
    sumSize = Tagged 1

Encoding speed increase based on the variantions I tested:

binary-master: 1.0 binary INLINE: 1.2 (20% more throughput) binary INLINE[0]-trick: 1.9 (+90% throughput).

This was collected using this cursed benchmark which while pretty unrealistic gives at least an idea about the potential impact of full specializaton.

You need to compile with -ffull-laziness for these results to make sense or you will just measure allocation performance.

main = do
  let lists = replicate 50000 Var0
      lbs = encode $ lists
  lengths <- (sum <$>) $ forM [0..2000] $ \x -> do
    return $! (BS.length $ BS.toStrict $ encode (lists)) + x
  print lengths

@AndreasPK did this change make it into the binary library?

You also used this as a motivating example for improving the inliner. Can you offer a repro case, so I can see if GHC MR !11579 gets it?

@simonpj I don't believe there have been any INLINE pragmas added in Data.Binary.Generic.

@AndreasPK @kolmodin what is the status here?

So far as I can tell, @AndreasPK has identified that the binary library (maintainer: @kolmodin ) could run with 90% more throughput if we do a few simple things.

I'm not close enough to the problem or the solution, but it certainly sounds attractive -- binary is widely used. Can I help? If someone wants to run with this, I'd be happy to have a call to figure out what to do.

haskell / binary

Performance of Generic-based instances. #200