haskell-crypto / cryptonite

lowlevel set of cryptographic primitives for haskell
Other
226 stars 139 forks source link

make Digest a strict datatype? #70

Open joeyh opened 8 years ago

joeyh commented 8 years ago

Digest can be partially evaluated, which presents several gotchas.

I had some simple code using cryptonite to get a hash:

md5 :: L.ByteString -> Digest MD5
md5 = hashLazy

md5hasher :: L.ByteString -> String
md5hasher = show . md5

md5hasher <$> L.readFile foo

This worked ok, running in constant space. But, it kept the file handle open until some later point when the hash was used (or possibly open past use until garbage collected). That turned out to prevent eg, deleting the file under windows (which doesn't allow open files to be deleted), which broke my test suite. So, I made what seemed like an innocuous change to "fix" that:

!hash <- md5hasher <$> L.readFile foo
return hash

But this version sometimes buffers the whole file content in memory. The bang pattern doesn't fully force the String, and so the Digest isn't fully forced either, and the hash can end up partially evaluated. In one case, hashing a 30 gb file resulted in a 30 gb malloc!

So, I had to do this (or could have used the NFData instance for Digest for same effect):

!hash <- md5hasher <$> L.readFile foo
return (length hash `seq` hash)

Which I think will avoid all problems, but was not easy to arrive at.

From my brief review of the API, I don't see any benefits to letting Digest be partially evaluated. So why not make the data type internally strict, or have hashLazy fully force it.

vincenthz commented 8 years ago

Yes, I think that's a sensible change. I wonder if Context should have the same treatment

vincenthz commented 8 years ago

I'm a bit puzzled actually. Just was looking at the code, and trying couple of things to try to reproduce but just couldn't; is the problem not in the part that transform a Digest x -> String and lack of evaluation of String ?

joeyh commented 8 years ago

Vincent Hanquez wrote:

I'm a bit puzzled actually. Just was looking at the code, and trying couple of things to try to reproduce but just couldn't; is the problem not in the code that transform a Digest x -> String ?

I had some difficulty reproducing it too, only saw the problem on some systems and not others, and only in some use patterns (comparison with known checksums).

I don't see how Digest x -> String could need 3 terabytes of memory, so I assume the memory use is in a partially evaluated Digest keeping references to parts of the lazy ByteString, so that the whole input data remains in memory.

see shy jo

vincenthz commented 8 years ago

sorry for being unclear, but when I say the problem is in the Digest x -> String part, I meant it's actually not forcing any evaluation of the Digest anyway unless you're actually looking at least at the first char which ! is not going to do.

i.e. the problem is:

 !hashAsString <- md5hasher <$> L.readline foo
 return hashAsString

Whereas if you write:

!hash <- md5 <$> L.readFile foo
return $ show hash

I think the ! is doing the right thing in the second case, and only the digest -> String would be done lazily. Whereas in the first case, the ! on a String would do nothing.

vincenthz commented 8 years ago

although, it doesn't quite explain needing so much memory, maybe some context are not folded properly to the next one..

joeyh commented 8 years ago

Yes, part of the problem was indeed that bang doesn't fully force String. It forced it enough to fix my FD-remaining-open problem, but not enough to avoid buffering the file in ram. While that was my mistake to make, it wouldn't have mattered that I made that mistake if Digest was itself strict.

Also, it's not really clear from the public API whether bang fully forces a Digest. newtype Digest a = Digest Bytes, and the implementation of Bytes is not exposed so it's difficult to reason about how to force it, other than using the handy NFData instance of course.

see shy jo