Open AndreasPK opened 2 years ago
Marking gput for :+:
as INLINE[0] with the rest marked as INLINE seems to fix this. I assume this is since this way we end up inlining what amounts to put_con
since it's small early on. In phase zero we then inline the gput instance for product representations and this shortens out with the generic representation.
That is we use
instance ( GSumPut a, GSumPut b
, SumSize a, SumSize b) => GBinaryPut (a :+: b) where
{-# INLINE[0] gput #-}
with all other generic put instances being INLINE. The only question is then how big the impact of this would be on compile time.
Here is the full fast version of the Data.Binary.Generic
code for future reference. I might or might not go through the motions of putting up a patch seeing how there are some very old ones with no activity.
{-# LANGUAGE BangPatterns, CPP, FlexibleInstances, KindSignatures,
ScopedTypeVariables, TypeOperators, TypeSynonymInstances #-}
{-# LANGUAGE Safe #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}
#if __GLASGOW_HASKELL__ >= 800
#define HAS_DATA_KIND
#endif
-----------------------------------------------------------------------------
-- |
-- Module : Data.Binary.Generic
-- Copyright : Bryan O'Sullivan
-- License : BSD3-style (see LICENSE)
--
-- Maintainer : Bryan O'Sullivan <bos@serpentine.com>
-- Stability : unstable
-- Portability : Only works with GHC 7.2 and newer
--
-- Instances for supporting GHC generics.
--
-----------------------------------------------------------------------------
module Data.Binary.Generic
(
) where
import Control.Applicative
import Data.Binary.Class
import Data.Binary.Get
import Data.Binary.Put
import Data.Bits
import Data.Word
#if !MIN_VERSION_base(4,11,0)
import Data.Monoid ((<>))
#endif
#ifdef HAS_DATA_KIND
import Data.Kind
#endif
import GHC.Generics
import Prelude -- Silence AMP warning.
-- Type without constructors
instance GBinaryPut V1 where
{-# INLINE gput #-}
gput _ = pure ()
instance GBinaryGet V1 where
gget = return undefined
-- Constructor without arguments
instance GBinaryPut U1 where
{-# INLINE gput #-}
gput U1 = pure ()
instance GBinaryGet U1 where
gget = return U1
-- Product: constructor with parameters
instance (GBinaryPut a, GBinaryPut b) => GBinaryPut (a :*: b) where
{-# INLINE gput #-}
gput (x :*: y) = gput x <> gput y
instance (GBinaryGet a, GBinaryGet b) => GBinaryGet (a :*: b) where
gget = (:*:) <$> gget <*> gget
-- Metadata (constructor name, etc)
instance GBinaryPut a => GBinaryPut (M1 i c a) where
{-# INLINE gput #-}
gput = gput . unM1
instance GBinaryGet a => GBinaryGet (M1 i c a) where
gget = M1 <$> gget
-- Constants, additional parameters, and rank-1 recursion
instance Binary a => GBinaryPut (K1 i a) where
{-# INLINE gput #-}
gput = put . unK1
instance Binary a => GBinaryGet (K1 i a) where
gget = K1 <$> get
-- Borrowed from the cereal package.
-- The following GBinary instance for sums has support for serializing
-- types with up to 2^64-1 constructors. It will use the minimal
-- number of bytes needed to encode the constructor. For example when
-- a type has 2^8 constructors or less it will use a single byte to
-- encode the constructor. If it has 2^16 constructors or less it will
-- use two bytes, and so on till 2^64-1.
#define GUARD(WORD) (size - 1) <= fromIntegral (maxBound :: WORD)
#define PUTSUM(WORD) GUARD(WORD) = putSum (0 :: WORD) (fromIntegral size)
#define GETSUM(WORD) GUARD(WORD) = (get :: Get WORD) >>= checkGetSum (fromIntegral size)
instance ( GSumPut a, GSumPut b
, SumSize a, SumSize b) => GBinaryPut (a :+: b) where
{-# INLINE[0] gput #-}
gput | PUTSUM(Word8) | PUTSUM(Word16) | PUTSUM(Word32) | PUTSUM(Word64)
| otherwise = sizeError "encode" size
where
size = unTagged (sumSize :: Tagged (a :+: b) Word64)
instance ( GSumGet a, GSumGet b
, SumSize a, SumSize b) => GBinaryGet (a :+: b) where
gget | GETSUM(Word8) | GETSUM(Word16) | GETSUM(Word32) | GETSUM(Word64)
| otherwise = sizeError "decode" size
where
size = unTagged (sumSize :: Tagged (a :+: b) Word64)
sizeError :: Show size => String -> size -> error
sizeError s size =
error $ "Can't " ++ s ++ " a type with " ++ show size ++ " constructors"
------------------------------------------------------------------------
checkGetSum :: (Ord word, Num word, Bits word, GSumGet f)
=> word -> word -> Get (f a)
checkGetSum size code | code < size = getSum code size
| otherwise = fail "Unknown encoding for constructor"
{-# INLINE checkGetSum #-}
class GSumGet f where
getSum :: (Ord word, Num word, Bits word) => word -> word -> Get (f a)
class GSumPut f where
putSum :: (Num w, Bits w, Binary w) => w -> w -> f a -> Put
instance (GSumGet a, GSumGet b) => GSumGet (a :+: b) where
getSum !code !size | code < sizeL = L1 <$> getSum code sizeL
| otherwise = R1 <$> getSum (code - sizeL) sizeR
where
sizeL = size `shiftR` 1
sizeR = size - sizeL
instance (GSumPut a, GSumPut b) => GSumPut (a :+: b) where
{-# INLINE putSum #-}
putSum !code !size s = case s of
L1 x -> putSum code sizeL x
R1 x -> putSum (code + sizeL) sizeR x
where
sizeL = size `shiftR` 1
sizeR = size - sizeL
instance GBinaryGet a => GSumGet (C1 c a) where
getSum _ _ = gget
instance GBinaryPut a => GSumPut (C1 c a) where
{-# INLINE putSum #-}
putSum !code _ x = put code <> gput x
------------------------------------------------------------------------
class SumSize f where
sumSize :: Tagged f Word64
#ifdef HAS_DATA_KIND
newtype Tagged (s :: Type -> Type) b = Tagged {unTagged :: b}
#else
newtype Tagged (s :: * -> *) b = Tagged {unTagged :: b}
#endif
instance (SumSize a, SumSize b) => SumSize (a :+: b) where
sumSize = Tagged $ unTagged (sumSize :: Tagged a Word64) +
unTagged (sumSize :: Tagged b Word64)
instance SumSize (C1 c a) where
sumSize = Tagged 1
Encoding speed increase based on the variantions I tested:
binary-master: 1.0 binary INLINE: 1.2 (20% more throughput) binary INLINE[0]-trick: 1.9 (+90% throughput).
This was collected using this cursed benchmark which while pretty unrealistic gives at least an idea about the potential impact of full specializaton.
You need to compile with -ffull-laziness for these results to make sense or you will just measure allocation performance.
main = do
let lists = replicate 50000 Var0
lbs = encode $ lists
lengths <- (sum <$>) $ forM [0..2000] $ \x -> do
return $! (BS.length $ BS.toStrict $ encode (lists)) + x
print lengths
@AndreasPK did this change make it into the binary library?
You also used this as a motivating example for improving the inliner. Can you offer a repro case, so I can see if GHC MR !11579 gets it?
@simonpj I don't believe there have been any INLINE
pragmas added in Data.Binary.Generic
.
@AndreasPK @kolmodin what is the status here?
So far as I can tell, @AndreasPK has identified that the binary
library (maintainer: @kolmodin ) could run with 90% more throughput if we do a few simple things.
I'm not close enough to the problem or the solution, but it certainly sounds attractive -- binary
is widely used. Can I help? If someone wants to run with this, I'd be happy to have a call to figure out what to do.
As perhaps obvious from my other recent tickets I've been looking at how Binary gets compiled as a matter of investigating GHC performance.
While looking at the resulting code I found that generic-based instances generally don't fully optimize away the overhead of generics.
In particular I've looked at slight variations of this code:
I found that for an expression like this (and having split the deriving into it's own module):
It results in this kind of core on ghcs master branch currently:
Which looks fine assuming the code in the Derive module we call is just the "put" method for each constructor. But sadly instead these methods all end up calling the generic put method (but at least with a statically computed generic representation of the individual constructor).
For runtime performance the issue here is that $w$cgput doesn't get inlined. And indeed for a regular function it's rather large so it not being inlined is not unexpected. But we could force it to inline trivially by adding INLINE pragmas on the methods in in
Data.Binary.Generic
.And indeed I tried this and for encoding the example data type above via it's generic instance allocations at runtime went down by around a third and runtime similarly improved significantly (but I didn't take exact measurements for runtime).
This isn't, sadly, enough for complete elimination of overhead. The resulting partial specialized in pseudo code looks something like:
If
put_con
would inline it would cancel out withgeneric_var0_rep
, same if it where to get specialized by SpecConstr but since it's non-recursive this doesn't happen either.Not that nothing about this is allocating so this is a good win over the current behaviour. But there are a lot of conditional branches taken in order to compute the encoding and the overhead is also larger with larger types.
Maybe this too can be fixed with with some well place INLINE pragmas or
inline
applications.