Open joeyh opened 8 years ago
Yes, I think that's a sensible change. I wonder if Context should have the same treatment
I'm a bit puzzled actually. Just was looking at the code, and trying couple of things to try to reproduce but just couldn't; is the problem not in the part that transform a Digest x -> String
and lack of evaluation of String
?
Vincent Hanquez wrote:
I'm a bit puzzled actually. Just was looking at the code, and trying couple of things to try to reproduce but just couldn't; is the problem not in the code that transform a Digest x -> String ?
I had some difficulty reproducing it too, only saw the problem on some systems and not others, and only in some use patterns (comparison with known checksums).
I don't see how Digest x -> String could need 3 terabytes of memory, so I assume the memory use is in a partially evaluated Digest keeping references to parts of the lazy ByteString, so that the whole input data remains in memory.
see shy jo
sorry for being unclear, but when I say the problem is in the Digest x -> String
part, I meant it's actually not forcing any evaluation of the Digest anyway unless you're actually looking at least at the first char which ! is not going to do.
i.e. the problem is:
!hashAsString <- md5hasher <$> L.readline foo
return hashAsString
Whereas if you write:
!hash <- md5 <$> L.readFile foo
return $ show hash
I think the !
is doing the right thing in the second case, and only the digest -> String
would be done lazily. Whereas in the first case, the ! on a String would do nothing.
although, it doesn't quite explain needing so much memory, maybe some context are not folded properly to the next one..
Yes, part of the problem was indeed that bang doesn't fully force String. It forced it enough to fix my FD-remaining-open problem, but not enough to avoid buffering the file in ram. While that was my mistake to make, it wouldn't have mattered that I made that mistake if Digest was itself strict.
Also, it's not really clear from the public API whether bang fully forces a Digest. newtype Digest a = Digest Bytes, and the implementation of Bytes is not exposed so it's difficult to reason about how to force it, other than using the handy NFData instance of course.
see shy jo
Digest can be partially evaluated, which presents several gotchas.
I had some simple code using cryptonite to get a hash:
This worked ok, running in constant space. But, it kept the file handle open until some later point when the hash was used (or possibly open past use until garbage collected). That turned out to prevent eg, deleting the file under windows (which doesn't allow open files to be deleted), which broke my test suite. So, I made what seemed like an innocuous change to "fix" that:
But this version sometimes buffers the whole file content in memory. The bang pattern doesn't fully force the String, and so the Digest isn't fully forced either, and the hash can end up partially evaluated. In one case, hashing a 30 gb file resulted in a 30 gb malloc!
So, I had to do this (or could have used the NFData instance for Digest for same effect):
Which I think will avoid all problems, but was not easy to arrive at.
From my brief review of the API, I don't see any benefits to letting Digest be partially evaluated. So why not make the data type internally strict, or have hashLazy fully force it.