Open basvandijk opened 12 years ago
Sorry for not getting back to you earlier. I wasn't sure how to handle this request. I don't think (1) is an option. We can either expose the internals (through Data.HashMap.Strict.Internal
say) or export two functions, fromListWithHash
and toListWithHash
. I don't think we can make the latter two fuse so I don't know if they're sufficient for your needs.
Hi, I can't find Data.HashSet.Internal in the main code, was this not merged in? This extension seems promising to solve the following: building a 300MB [HashSet] in RAM, then using cereal encode by means of fromList (see above). When cereal decode rebuilds this in a fresh run, the datastructure suddenly occupies 1GB RAM.
It seems like this was merged in, but then the constructors were hidden again in a later release. I will send a new pull request which exposes just the minimal set of constructors needed to write these instances.
Even then the instances look a little nasty, because we need to write one for GHC's array, but it causes a huge speedup which is definitely worth the nasty orphans.
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UnboxedTuples #-}
import Control.Monad (replicateM)
import Data.Binary
import Data.HashMap.Internal
import GHC.Exts (Int (..), indexArray#, newArray#,
sizeofArray#, unsafeFreezeArray#,
writeArray#, (+#), (<#))
import GHC.Generics (Generic)
import GHC.ST (ST (..), runST)
deriving instance Generic (Leaf k v)
deriving instance Generic (HashMap k v)
instance Binary a => Binary (Array a) where
put (Array arr#) = do
put (I# (sizeofArray# arr#))
go 0#
where
go i# = case i# <# sizeofArray# arr# of
0# -> return ()
_ -> case indexArray# arr# i# of
(# x #) -> put x >> go (i# +# 1#)
get = do
I# len# <- get
list <- replicateM (I# len#) get
return $ runST $ ST $ \s# ->
case newArray# len# undefined s# of
(# s1#, marr# #) -> case go 0# marr# list s1# of
s2# -> case unsafeFreezeArray# marr# s2# of
(# s3#, arr# #) -> (# s3#, Array arr# #)
where
go _ _ [] s# = s#
go i# marr# (x : xs) s# =
case writeArray# marr# i# x s# of
s'# -> go (i# +# 1#) marr# xs s'#
instance (Binary k, Binary v) => Binary (Leaf k v)
instance (Binary k, Binary v) => Binary (HashMap k v)
I'm completely changing the internal representation so if we expose the current one, everything will just break in the next release.
Do we have a good understanding why toList
doesn't work well? It's writing in terms of build/foldr fusion so it should fuse with a consumer like mapM_
.
What's the speed-up of the above?
I chatted a bit with @dcoutts about this. He thinks we could get the traversals to fuse, but we need the API to expose a way to not recomputing the hashes e.g. by exporting fromListWithHash
and toListWithHash
(perhaps with an unsafe
prefix).
Seems to me that a toListWithHash
should be able to fuse fine. It also seems plausible that we could write a fusable fromListWithHash
.
Hi. I badly needed Binary instance for HashMap, so I incorporated all files from unordered-containers-0.2.7.0 into my project and added the following modules, based on @jaspervdj post
module Data.HashMap.Internal
(
HashMap(..)
, Array(..)
, Leaf(..)
)
where
import Data.HashMap.Base
import Data.HashMap.Array
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UnboxedTuples #-}
module Data.HashMap.Binary where
import Control.Monad (replicateM)
import Data.Binary
import Data.HashMap.Internal
import GHC.Exts (Int (..), indexArray#, newArray#,
sizeofArray#, unsafeFreezeArray#,
writeArray#, (+#), (<#))
import GHC.Generics (Generic)
import GHC.ST (ST (..), runST)
deriving instance Generic (Leaf k v)
deriving instance Generic (HashMap k v)
instance Binary a => Binary (Array a) where
put (Array arr#) = do
put (I# (sizeofArray# arr#))
go 0#
where
go i# = case i# <# sizeofArray# arr# of
0# -> return ()
_ -> case indexArray# arr# i# of
(# x #) -> put x >> go (i# +# 1#)
get = do
I# len# <- get
list <- replicateM (I# len#) get
return $ runST $ ST $ \s# ->
case newArray# len# undefined s# of
(# s1#, marr# #) -> case go 0# marr# list s1# of
s2# -> case unsafeFreezeArray# marr# s2# of
(# s3#, arr# #) -> (# s3#, Array arr# #)
where
go _ _ [] s# = s#
go i# marr# (x : xs) s# =
case writeArray# marr# i# x s# of
s'# -> go (i# +# 1#) marr# xs s'#
instance (Binary k, Binary v) => Binary (Leaf k v)
instance (Binary k, Binary v) => Binary (HashMap k v)
Let it be there as a ready-to-go temporal solution.
My main concern with this concept is that an adversary could very easily give us a bogus serialization, with hashes that don't match, etc. Is cereal
intended to be used only with trusted data? If this is for a function to be used in a trusted context only, that's fine.
We would also need this for a project. At least here, the binary data are trusted, so it would be fine and much faster than the current solution.
My PR https://github.com/haskell-unordered-containers/unordered-containers/pull/283 seems useful in the context of this issue since it exposes the internal modules. Reviews and feedback are very welcome! :)
Hi Johan,
I would like to have an efficient way of serializing (with cereal) HashMaps and HashSets. I think going through lists is inefficient because the hashes have to be recomputed (although I have not benchmarked this yet, so I could be wrong):
So ideally I would like the following instances:
However this requires access to the constructors which is only possible if:
unordered-containers
. Which unfortunately adds a dependency oncereal
.Data.HashMap.Unsafe
module. This module can be marked as{-# LANGUAGE Unsafe #-}
so that users ofSafe
modules can't use this module. Then users like me can write the (orphan) instances in their own applications.Do you have a preference?
Regards,
Bas