haskell-unordered-containers / hashable

A class for types that can be converted to a hash value
BSD 3-Clause "New" or "Revised" License
103 stars 86 forks source link

Caching hash value sometimes leads to equal hashes for different inputs #275

Closed sergv closed 4 months ago

sergv commented 1 year ago

I believe this issue is related to #270 but I didn’t want to hijack the discussion there.

Problem

I wrote a program that constructs ASTs where each layer stores hash of the underlying subtree. With explicit sharing the tree can get exponentially large so hash must be cached, recomputing it will take too long.

The problem is that hash for tree Const 1 and Negate (Negate (Const 1)) is the same. It reproduces both with my hand-written wrapper that stores hash value (HashFix) and for the wrapper that uses the Hashed type (HashedFix).

I’ve reproduced the hash computations that take place when hashable runs. According to my analysis the problem comes from the hashInt function that combines hashes of subtrees to produce hash of whole node. It’s defined as

hashInt salt bytes = (salt * magicPrime) `xor` bytes

For recomputing hashes from scratch it serves well because salt gets passed around but when hash value is cached it breaks the dependency between salt and the hash value (hash is computed with default salt). Through Hashed the hash value goes into bytes argument, thus successive Hashed applications on different AST levels will lead to a computation like

saltLevel1 `xor` saltLevel2 `xor` base hash

If both AST levels are the same constructor then saltLevel1 will be equal to `saltLevel2 and they will cancel out due to xor.

I reproduced detailed computations in the program, please take a look.

The full program:

#!/usr/bin/env cabal
{- cabal:
build-depends:
  , base
  , hashable
-}

{-# LANGUAGE DeriveGeneric              #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE StandaloneDeriving         #-}
{-# LANGUAGE UndecidableInstances       #-}

module Main where

import Data.Bits
import Data.Hashable
import GHC.Generics

import Data.Hashable

data ExprF a
  = Add a a
  | Negate a
  | Increment a
  | Const !Int
  deriving (Eq, Ord, Show, Generic)

instance Hashable a => Hashable (ExprF a)

data HashFix f = HashFix
  { hfVal  :: f (HashFix f)
  , hfHash :: !Int
  }

mkFix :: Hashable (f (HashFix f)) => f (HashFix f) -> HashFix f
mkFix x = HashFix
  { hfVal  = x
  , hfHash = hash x
  }

deriving instance Eq  (f (HashFix f)) => Eq  (HashFix f)
deriving instance Ord (f (HashFix f)) => Ord (HashFix f)

instance Hashable (f (HashFix f)) => Hashable (HashFix f) where
  hashWithSalt salt = hashWithSalt salt . hash
  hash = hfHash

newtype HashedFix f = HashedFix
  { unHashedFix :: Hashed (f (HashedFix f))
  }

deriving instance Hashable (f (HashedFix f)) => Hashable (HashedFix f)

mkHashedFix :: Hashable (f (HashedFix f)) => f (HashedFix f) -> HashedFix f
mkHashedFix = HashedFix . hashed

deriving instance Eq  (f (HashedFix f)) => Eq  (HashedFix f)
deriving instance Ord (f (HashedFix f)) => Ord (HashedFix f)

main :: IO ()
main = do
  do
    let base = mkFix $ Const 1
        e1   = mkFix $ Negate base
        e2   = mkFix $ Negate e1

    putStrLn $ "hash base            = " ++ show (hash base)
    putStrLn $ "hash e1              = " ++ show (hash e1)
    putStrLn $ "hash e2              = " ++ show (hash e2)
    putStrLn $ "hash base == hash e2 = " ++ show (hash base == hash e2)

  do
    let base = mkHashedFix $ Const 1
        e1   = mkHashedFix $ Negate base
        e2   = mkHashedFix $ Negate e1

    putStrLn $ "hash base            = " ++ show (hash base)
    putStrLn $ "hash e1              = " ++ show (hash e1)
    putStrLn $ "hash e2              = " ++ show (hash e2)
    putStrLn $ "hash base == hash e2 = " ++ show (hash base == hash e2)

  let defaultSalt :: Int
      defaultSalt = -3750763034362895579

      -- Magic prime
      p :: Int
      p = 1099511628211

      constHash :: Int -> Int
      constHash x = (xor ((*) (xor (defaultSalt * p) 3) p) x)

      baseHash :: Int
      baseHash = constHash 1

      negateHash1 :: Int
      negateHash1 = hashWithSalt (xor (defaultSalt * p) 1) baseHash

      negateHash2 :: Int
      negateHash2 = hashWithSalt (xor (defaultSalt * p) 1) negateHash1

      hashInt s x = (s * p) `xor` x -- (x * p) `xor` x <- suggested improvement

      negateHash1Detailed :: Int
      negateHash1Detailed = hashInt (xor (defaultSalt * p) 1) baseHash

      negateHash2Detailed :: Int
      negateHash2Detailed = hashInt (xor (defaultSalt * p) 1) negateHash1

      negateHash1Detailed' :: Int
      negateHash1Detailed' = ((xor (defaultSalt * p) 1) * p) `xor` baseHash

      negateHash2Detailed' :: Int
      negateHash2Detailed' = ((xor (defaultSalt * p) 1) * p) `xor` negateHash1

      negateHash2Detailed'' :: Int
      negateHash2Detailed'' =
        ((xor (defaultSalt * p) 1) * p) `xor` ((xor (defaultSalt * p) 1) * p) `xor` baseHash

  putStrLn $ "baseHash              = " ++ show baseHash
  putStrLn $ "negateHash1           = " ++ show negateHash1
  putStrLn $ "negateHash2           = " ++ show negateHash2

  putStrLn $ "negateHash1Detailed   = " ++ show negateHash1Detailed
  putStrLn $ "negateHash2Detailed   = " ++ show negateHash2Detailed

  putStrLn $ "negateHash1Detailed'  = " ++ show negateHash1Detailed'
  putStrLn $ "negateHash2Detailed'  = " ++ show negateHash2Detailed'

  putStrLn $ "negateHash2Detailed'' = " ++ show negateHash2Detailed''

  putStrLn $ "baseHash == negateHash2Detailed'' = " ++ show (baseHash == negateHash2Detailed'')

Output:

hash base            = 590680769285548757
hash e1              = 2199023256815
hash e2              = 590680769285548757
hash base == hash e2 = True
hash base            = 590680769285548757
hash e1              = 2199023256815
hash e2              = 590680769285548757
hash base == hash e2 = True
baseHash              = 590680769285548757
negateHash1           = 2199023256815
negateHash2           = 590680769285548757
negateHash1Detailed   = 2199023256815
negateHash2Detailed   = 590680769285548757
negateHash1Detailed'  = 2199023256815
negateHash2Detailed'  = 590680769285548757
negateHash2Detailed'' = 590680769285548757
baseHash == negateHash2Detailed'' = True

Proposed solution

Switcth from FNV-1, which does this

hashInt s x = (s * p) `xor` x

to FNV-1a which swaps salt and bytes to be hashed:

hashInt s x = (x * p) `xor` s

Thus nested applications of Hashed won’t go into the xor part.

phadej commented 1 year ago

In a branch of #273, I get

hash base == hash e2 = False
hash base == hash e2 = False

So AFAICS it will resolve this issue as well.


Also, we don't need to change anything in hashable to fix this particular problem. The issue is in instance of HashFix, so you can write different one, e.g.

instance Hashable (f (HashFix f)) => Hashable (HashFix f) where
  hashWithSalt salt x = (salt * hash x) `xor` somePrime

and it will work better.


Hashed wasn't considered for nested hashing (nor FNV-1 to begin with AFAIK), so I'd say it's a missing feature instead of a concrete bug.

Something similar happens in in PRNG, most PRNGs are not splittable and you cannot just turn ordinary PRNG into splittable one (but if you have very good one e.g. cryptographically strong, then you can quite easily in fact). So here we'd need a hash algorithm which explicitly considers usage in Merkle-tree like applications. I'm aware only of cryptographic hashes being good for that. I'm unaware of "fast" hashes designed with Merkle-tree usage in mind.

As you cache the hash anyway, I'd consider using e.g. cryptohash-sha256 (with maybe wide-words Word256 for a digest) for merkle tree of your AST, and return e.g. lowest word for Hashable instance. There's also other benefits which you can rip from that (as SHA256 is collision resistance, you can use it for fast equality, you cannot ever use hashable for that).