haskell-unordered-containers / unordered-containers

Efficient hashing-based container types
BSD 3-Clause "New" or "Revised" License
221 stars 97 forks source link

Efficiently serializing HashMaps and HashSets #25

Open basvandijk opened 12 years ago

basvandijk commented 12 years ago

Hi Johan,

I would like to have an efficient way of serializing (with cereal) HashMaps and HashSets. I think going through lists is inefficient because the hashes have to be recomputed (although I have not benchmarked this yet, so I could be wrong):

instance (Serialize a, Hashable a, Eq a) => Serialize (HS.HashSet a) where
    get = HS.fromList <$> get
    put = put . HS.toList

instance (Serialize k, Serialize v, Eq k, Hashable k) => Serialize (HMS.HashMap k v) where
    get = HMS.fromList <$> get
    put = put . HMS.toList

So ideally I would like the following instances:

instance (Serialize k, Serialize v) => Serialize (FullList k v) where
    put (FL k v fl) = put k >> put v >> put fl
    get = liftM3 FL get get get

instance (Serialize k, Serialize v) => Serialize (List k v) where
    put Nil          = putWord8 0
    put (Cons k v l) = putWord8 1 >> put k >> put v >> put l
    get = do tag <- getWord8
             case tag of
               0 -> return Nil 
               1 -> liftM3 Cons get get get
               _ -> fail "Data.FullList.Lazy: Decoding error: unknown tag"

instance (Serialize k, Serialize v) => Serialize (HashMap k v) where
    put (Bin sm l r) = putWord8 0 >> put sm >> put l >> put r
    put (Tip h fl)   = putWord8 1 >> put h >> put fl
    put Nil          = putWord8 2

    get = do tag <- getWord8
             case tag of
               0 -> liftM3 Bin get get get
               1 -> liftM2 Tip get get
               2 -> return Nil
               _ -> fail "Data.HashMap.Common: Decoding error: unknown tag"

However this requires access to the constructors which is only possible if:

  1. You add these instances to unordered-containers. Which unfortunately adds a dependency on cereal.
  2. Export the constructors from a new Data.HashMap.Unsafe module. This module can be marked as {-# LANGUAGE Unsafe #-} so that users of Safe modules can't use this module. Then users like me can write the (orphan) instances in their own applications.

Do you have a preference?

Regards,

Bas

tibbe commented 12 years ago

Sorry for not getting back to you earlier. I wasn't sure how to handle this request. I don't think (1) is an option. We can either expose the internals (through Data.HashMap.Strict.Internal say) or export two functions, fromListWithHash and toListWithHash. I don't think we can make the latter two fuse so I don't know if they're sufficient for your needs.

fegu commented 11 years ago

Hi, I can't find Data.HashSet.Internal in the main code, was this not merged in? This extension seems promising to solve the following: building a 300MB [HashSet] in RAM, then using cereal encode by means of fromList (see above). When cereal decode rebuilds this in a fresh run, the datastructure suddenly occupies 1GB RAM.

jaspervdj commented 9 years ago

It seems like this was merged in, but then the constructors were hidden again in a later release. I will send a new pull request which exposes just the minimal set of constructors needed to write these instances.

Even then the instances look a little nasty, because we need to write one for GHC's array, but it causes a huge speedup which is definitely worth the nasty orphans.

{-# LANGUAGE DeriveGeneric      #-}
{-# LANGUAGE MagicHash          #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UnboxedTuples      #-}

import           Control.Monad         (replicateM)
import           Data.Binary
import           Data.HashMap.Internal
import           GHC.Exts              (Int (..), indexArray#, newArray#,
                                        sizeofArray#, unsafeFreezeArray#,
                                        writeArray#, (+#), (<#))
import           GHC.Generics          (Generic)
import           GHC.ST                (ST (..), runST)

deriving instance Generic (Leaf k v)
deriving instance Generic (HashMap k v)

instance Binary a => Binary (Array a) where
    put (Array arr#) = do
        put (I# (sizeofArray# arr#))
        go 0#
      where
        go i# = case i# <# sizeofArray# arr# of
            0# -> return ()
            _  -> case indexArray# arr# i# of
                (# x #) -> put x >> go (i# +# 1#)

    get = do
        I# len# <- get
        list    <- replicateM (I# len#) get
        return $ runST $ ST $ \s# ->
            case newArray# len# undefined s# of
                (# s1#, marr# #) -> case go 0# marr# list s1# of
                    s2# -> case unsafeFreezeArray# marr# s2# of
                        (# s3#, arr# #) -> (# s3#, Array arr# #)
      where
        go _  _     []       s# = s#
        go i# marr# (x : xs) s# =
            case writeArray# marr# i# x s# of
                s'# -> go (i# +# 1#) marr# xs s'#

instance (Binary k, Binary v) => Binary (Leaf k v)
instance (Binary k, Binary v) => Binary (HashMap k v)
tibbe commented 9 years ago

I'm completely changing the internal representation so if we expose the current one, everything will just break in the next release.

Do we have a good understanding why toList doesn't work well? It's writing in terms of build/foldr fusion so it should fuse with a consumer like mapM_.

What's the speed-up of the above?

tibbe commented 9 years ago

I chatted a bit with @dcoutts about this. He thinks we could get the traversals to fuse, but we need the API to expose a way to not recomputing the hashes e.g. by exporting fromListWithHash and toListWithHash (perhaps with an unsafe prefix).

dcoutts commented 9 years ago

Seems to me that a toListWithHash should be able to fuse fine. It also seems plausible that we could write a fusable fromListWithHash.

sergei-mironov commented 8 years ago

Hi. I badly needed Binary instance for HashMap, so I incorporated all files from unordered-containers-0.2.7.0 into my project and added the following modules, based on @jaspervdj post

module Data.HashMap.Internal
  (
    HashMap(..)
  , Array(..)
  , Leaf(..)
  )
  where

import Data.HashMap.Base
import Data.HashMap.Array
{-# LANGUAGE DeriveGeneric      #-}
{-# LANGUAGE MagicHash          #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UnboxedTuples      #-}

module Data.HashMap.Binary where

import           Control.Monad         (replicateM)
import           Data.Binary
import           Data.HashMap.Internal
import           GHC.Exts              (Int (..), indexArray#, newArray#,
                                        sizeofArray#, unsafeFreezeArray#,
                                        writeArray#, (+#), (<#))
import           GHC.Generics          (Generic)
import           GHC.ST                (ST (..), runST)

deriving instance Generic (Leaf k v)
deriving instance Generic (HashMap k v)

instance Binary a => Binary (Array a) where
    put (Array arr#) = do
        put (I# (sizeofArray# arr#))
        go 0#
      where
        go i# = case i# <# sizeofArray# arr# of
            0# -> return ()
            _  -> case indexArray# arr# i# of
                (# x #) -> put x >> go (i# +# 1#)

    get = do
        I# len# <- get
        list    <- replicateM (I# len#) get
        return $ runST $ ST $ \s# ->
            case newArray# len# undefined s# of
                (# s1#, marr# #) -> case go 0# marr# list s1# of
                    s2# -> case unsafeFreezeArray# marr# s2# of
                        (# s3#, arr# #) -> (# s3#, Array arr# #)
      where
        go _  _     []       s# = s#
        go i# marr# (x : xs) s# =
            case writeArray# marr# i# x s# of
                s'# -> go (i# +# 1#) marr# xs s'#

instance (Binary k, Binary v) => Binary (Leaf k v)
instance (Binary k, Binary v) => Binary (HashMap k v)

Let it be there as a ready-to-go temporal solution.

treeowl commented 4 years ago

My main concern with this concept is that an adversary could very easily give us a bogus serialization, with hashes that don't match, etc. Is cereal intended to be used only with trusted data? If this is for a function to be used in a trusted context only, that's fine.

clkamp commented 4 years ago

We would also need this for a project. At least here, the binary data are trusted, so it would be fine and much faster than the current solution.

sjakobi commented 4 years ago

My PR https://github.com/haskell-unordered-containers/unordered-containers/pull/283 seems useful in the context of this issue since it exposes the internal modules. Reviews and feedback are very welcome! :)