rudymatela / leancheck

enumerative property-based testing for Haskell
https://hackage.haskell.org/package/leancheck
Other
52 stars 8 forks source link

Document limitation: intense memory use #15

Closed JonasDuregard closed 4 years ago

JonasDuregard commented 5 years ago

When running a test like this I would expect constant memory usage:

main = do
  checkFor 1000000 evenorodd

evenorodd :: [Int] -> Bool
evenorodd xs = all (\x -> odd x || even x) xs

I would expect it to test a value, release it for garbage collection and move on to the next, using about the same amount of memory regardless of how many tests are done.

Instead, it seems to retain all tested values in memory throughtout the test (and in subsequent tests if those are added, until the program is terminated).

This is due to the types of list and tiers (both are lists). Basically, GHC will never garbage collect any part of those lists, resulting in severe memory leaks and possibly slower tests due to all the memory allocation (possibly offset by the advantage of not reconstructing shared values, depending on the types involved).

I suspect this is quite problematic for programs involving a large number of types, since typically it would use a linear amount of memory for each type.

The only way I know of resolving this is changing the type of the class members.

jwaldmann commented 5 years ago

Yes (keeping the list in memory - not GC-able). I always thought of this as a feature (memoization) of this library that's especially helpful when enumerating trees (or the like). I don't see how the choice (memoize or not) could be move to the user. Perhaps

class Listable a where tier :: Int -> [a]
tree_tiers :: [[Tree]]
instance Listable Tree where tier s = tree_tiers !! s

but does this work with the combinators? (Can we still write cons2 where the result is memoized?)

rudymatela commented 5 years ago

TL;DR (1): When set to run a large number of tests, LeanCheck is more memory intensive than Feat. TL;DR (2): GC does run during usage of LeanCheck to clear up tiers/list, just not that often.

I would expect it to test a value, release it for garbage collection and move on to the next, using about the same amount of memory regardless of how many tests are done.

The way LeanCheck is implemented causes it to be "chunky" with regards to memory usage. As it goes on to test a new tier, a huge chunk of memory is allocated for it. So actually, memory use increases with the number of tests whenever a new tier is reached. This was known by me and is briefly mentioned in the FAQ on the comparison to feat under memory consumption. Perhaps this should be more prominent in the docs? :thinking:

Instead, it seems to retain all tested values in memory throughtout the test

Sort of: I think this happens only for the values of the last tier.

(and in subsequent tests if those are added, until the program is terminated).

This is due to the types of list and tiers (both are lists). Basically, GHC will never garbage collect any part of those lists, resulting in severe memory leaks and possibly slower tests due to all the memory allocation.

I suspect this is quite problematic for programs involving a large number of types, since typically it would use a linear amount of memory for each type.

I don't think that is actually true. Consider the following program based on your example:

import Test.LeanCheck
import Data.Word
import Data.Int

main = do
  checkFor 1000000 (evenorodd :: [Int] -> Bool)

evenorodd :: Integral a => [a] -> Bool
evenorodd xs = all (\x -> odd x || even x) xs

It uses about 600 Mb of memory at its peak "resident set size":

$ /usr/bin/time -f"%es  %M Kb" ./example
+++ OK, passed 1000000 tests.
3.76s  615948 Kb

Now let's include a bunch of other types in the program:

import Test.LeanCheck
import Data.Word
import Data.Int

main = do
  checkFor 1000000 (evenorodd :: [Int] -> Bool)
  checkFor 1000000 (evenorodd :: [Word] -> Bool)
  checkFor 1000000 (evenorodd :: [Integer] -> Bool)
  checkFor 1000000 (evenorodd :: [Word8] -> Bool)
  checkFor 1000000 (evenorodd :: [Int8] -> Bool)
  checkFor 1000000 (evenorodd :: [Word16] -> Bool)
  checkFor 1000000 (evenorodd :: [Int16] -> Bool)
  checkFor 1000000 (evenorodd :: [Word32] -> Bool)
  checkFor 1000000 (evenorodd :: [Int32] -> Bool)
  checkFor 1000000 (evenorodd :: [Word64] -> Bool)
  checkFor 1000000 (evenorodd :: [Int64] -> Bool)

evenorodd :: Integral a => [a] -> Bool
evenorodd xs = all (\x -> odd x || even x) xs

If memory use would increase linearly with the number of types, we would expect to see something around 6 GB for peak memory usage, but we get around the same amount, 600MB:

$ /usr/bin/time -f"%es  %M Kb" ./example
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
+++ OK, passed 1000000 tests.
41.68s  615192 Kb

(Changing to -O0 does not seem to make a difference.)

This shows that GC does trigger to collect list and tiers during the course of the program. And I do remember having test programs with a non-trivial amount of test types and being fine with regards to memory usage.

Does the above make sense?

@JonasDuregard Perhaps you have another example that shows the leak where using new types cause a linear increase in memory consumption?

rudymatela commented 5 years ago

Incidentally, LeanCheck memory use does increase with the configured number of test values. And to the point where for some types when you get to millions of tests (give or take a couple orders of magnitude), you run out of memory. For the above program you provided, just setting it to 10 000 000 (ten million) tests makes it run out of memory on my environment. (This is a known foundamental issue with LC.)

From the point of view of day-to-day software testing, usually setting tests between 500 to 10 000 normally solves the problem for me. I have seen only a few of cases where I had to go over 10 000 tests to actually catch a bug in a property (small scope hypothesis), to which I didn't run out of memory. So on typical usage, LeanCheck memory usage shouldn't be an issue.

I know of one case LeanCheck wasn't able to find a real counterexample. John Hughes brought it up right after my thesis defense/viva and challenged me to try to make LeanCheck find it. I then set up a dummy property to which the failing test case would yield false. LeanCheck ran out of memory and was short of actually getting to the bug. I may add this example at some point to LeanCheck eg folder if I ever get the time.

I'll try to make the memory usage of LeanCheck more prominent in the documentation.

JonasDuregard commented 5 years ago

Hmm. It seems in compiled code GHC will garbage collect dictionaries when it is sure they will not be used again. Repeating tests with the same type will make it retain them:

main = do
  checkFor 1000000 (evenorodd :: [Int] -> Bool)
  checkFor 1000000 (evenorodd :: [Word] -> Bool)
  checkFor 1000000 (evenorodd :: [Integer] -> Bool)
  checkFor 1000000 (evenorodd :: [Word8] -> Bool)
  checkFor 1000000 (evenorodd :: [Int8] -> Bool)
  checkFor 1000000 (evenorodd :: [Word16] -> Bool)
  checkFor 1000000 (evenorodd :: [Int16] -> Bool)
  checkFor 1000000 (evenorodd :: [Word32] -> Bool)
  checkFor 1000000 (evenorodd :: [Int32] -> Bool)
  checkFor 1000000 (evenorodd :: [Word64] -> Bool)
  checkFor 1000000 (evenorodd :: [Int64] -> Bool)
  getLine -- Here we are using a lot of memory
  checkFor 1000000 (evenorodd :: [Int] -> Bool)
  checkFor 1000000 (evenorodd :: [Word] -> Bool)
  checkFor 1000000 (evenorodd :: [Integer] -> Bool)
  checkFor 1000000 (evenorodd :: [Word8] -> Bool)
  checkFor 1000000 (evenorodd :: [Int8] -> Bool)
  checkFor 1000000 (evenorodd :: [Word16] -> Bool)
  checkFor 1000000 (evenorodd :: [Int16] -> Bool)
  checkFor 1000000 (evenorodd :: [Word32] -> Bool)
  checkFor 1000000 (evenorodd :: [Int32] -> Bool)
  checkFor 1000000 (evenorodd :: [Word64] -> Bool)
  checkFor 1000000 (evenorodd :: [Int64] -> Bool)

This is of course rather contrived, but I wonder what would happen on a large interconnected set of types like the AST of Template Haskell or haskell-src-exts, would a single test involving all the types make it retain a list for each of them?

Regarding the documentation: Not sure it's a problem, more like I didn't actually read it :).

@jwaldmann

Yes (keeping the list in memory - not GC-able). I always thought of this as a feature (memoization) of this library that's especially helpful when enumerating trees (or the like). I don't see how the choice (memoize or not) could be move to the user.

I don't have data to support it, but I think the memoization might slow testing down here, heap memory allocation is pretty expensive and for simple properties it may be avoided all together using the fancy optimizations of GHC.

For more complicated properties the data generation is likely to be a very small percentage of the total runtime cost.

Perhaps

class Listable a where tier :: Int -> [a]
tree_tiers :: [[Tree]]
instance Listable Tree where tier s = tree_tiers !! s

That is basically what I did in Values: https://hackage.haskell.org/package/size-based-0.1.2.0/docs/Control-Enumerable-Values.html

It would be interesting to compare the speed of those approaches for some examples, maybe I'll do it if I have time.

jwaldmann commented 5 years ago

@rudymatela

one case LeanCheck wasn't able to find a real counterexample.

I'd like to see that! Perhaps it makes an instructive example that I can use in teaching. Of course, the function "s \mapsto maximum size of smallest counter-example for property that can be described by expression of size s" will grow fast (even uncomputably fast), so these examples must exist. Will they look natural? Cf. check $ \ (Natural n) -> n /= ackermann 4 4.

@JonasDuregard

I did not know about your library. Do you have a document (publication) that explains it? Does it have a provisions for fair enumeration (see #14 )

JonasDuregard commented 5 years ago

I did not know about your library. Do you have a document (publication) that explains it? Does it have a provisions for fair enumeration (see #14 )

This is a bit off-topic, but size-based is a library for generators based on size, supporting lots of different testing back-ends. leancheck could be an instance of Sized (basically the Value functor I linked to earlier). Count is another nice thing you can do, efficiently predicting the number of values in each tier. lazy-search is like a size-based variant of Lazy SmallCheck.

If you want fair enumeration within a tier, FEAT would work, depending on what level of fairness you want (enough to shuffle the order around a bit or do you want guarantees on certain combinations being tested early etc). In principle you could implement a uniformly random walk over the whole tier, but that would involve memory usage comparable to what this issue describes.

The nice thing about size-based is that you can write an instance of the Enumerable class (or derive it with Template Haskell) for a type and then generate values using any of these approaches, including leancheck if you add tiers=allValues.

If you want the technical detail of how the libraries work they are described in my thesis: http://publications.lib.chalmers.se/records/fulltext/240807/240807.pdf

rudymatela commented 5 years ago

Hmm. It seems in compiled code GHC will garbage collect dictionaries when it is sure they will not be used again. Repeating tests with the same type will make it retain them:

main = do
  checkFor 1000000 (evenorodd :: [Int] -> Bool)
  ... ... ...
  checkFor 1000000 (evenorodd :: [Int64] -> Bool)
  getLine -- Here we are using a lot of memory
  checkFor 1000000 (evenorodd :: [Int] -> Bool)
  ... ... ...
  checkFor 1000000 (evenorodd :: [Int64] -> Bool)

:scream: Oh! That one does run out of memory in my environment. Cool! :sunglasses: I'll add this example to the docs and record it as a limitation of LeanCheck.

This is of course rather contrived, but I wonder what would happen on a large interconnected set of types like the AST of Template Haskell or haskell-src-exts, would a single test involving all the types make it retain a list for each of them?

Actually, LeanCheck has no problem testing haskell-src-exts. I've used haskell-src-exts as an example of use of FitSpec (which as you know uses LeanCheck). It does not run out of memory at least for the parameters I used. If I recall correctly, I didn't go out of memory in the ballpark of 1000 tests (and a 1000 mutants). I tried running it now but that specific example does not compile anymore as haskell-src-exts changed :grimacing: breaking my automatic derivation using TH.

I have tried other stuff with non-trivial datatypes and LC works, e.g.: XMonad properties.

rudymatela commented 5 years ago

In summary, I think I won't change LC test data generation, I'll simply record the limitations on the docs. A couple reasons:

I do really appreciate the feedback and the suggestions :smile:. Thanks @JonasDuregard and @jwaldmann!

rudymatela commented 5 years ago

one case LeanCheck wasn't able to find a real counterexample.

I'd like to see that!

@jwaldmann I'll add it as an example at some point then. But I can't promise that it'll be soon.

rudymatela commented 5 years ago

I've marked this as partially documentation improvement and partially wontfix (for the enum algorithm change).

A TODO from the above:

(to be done in the next couple months or so)

I'll keep the issue open until I solve the above.

rudymatela commented 5 years ago

Fell free to carry on the discussion if you have more comments.

jwaldmann commented 5 years ago

@rudymatela nice TODO, looking forward to its implementation.

Hughes' example: (you can't find all bugs by checking small inputs) there's a similar statement in mathematics (number theory): the Law of Small Numbers, attrributed to Richard Guy, cited here: https://primes.utm.edu/glossary/page.php?sort=LawOfSmall , see examples at https://web.archive.org/web/20140928192643/http://www.math.niu.edu/~rusin/known-math/96/smallnums

bfrk commented 4 years ago

I like LeanCheck very much mainly because of its simplicity, and I would really like to use it seriously. Why does fixing the memory leak means it has to become more complicated? What has that to do with the type of 'list' as suggested by @JonasDuregard?

Take this stupid example, where I create an infinite list of lists and print each one.

good: mapM_ print $ take n $ [ replicate (nmod10) n | n <- [0..] ]

This create similarly sized lists compared to this one:

bad: mapM_ print $ take n $ list @[Int]

With no optimisations, good runs in constant space (5M), whereas bad uses more memory for larger n.

What am I missing here?

bfrk commented 4 years ago

Ah, never mind. I think I got it. We need to "reset" the streams to start at the beginning to get all combinations and to do that with lazy lists means we have to keep them in memory. This is bad. I guess enumerating is really something suited for imperative programming, rather than (purely) functional.

rudymatela commented 4 years ago

These items of the TODO are complete:

These two remain:

rudymatela commented 4 years ago

This is bad. I guess enumerating is really something suited for imperative programming, rather than (purely) functional.

@bfrk Actually you can be purely functional and still do enumeration of very late values efficiently in terms of memory. Feat does this (it is mentioned above in this thread). With it you can enumerate "very late values" efficiently. E.g.:

$ cabal install testing-feat
$ ghci
> import Test.Feat
> :set +s
> index 1000000000 :: [Bool]
[True,True,False,True,True,True,False,False,True,True,False,True,False,True,True,False,False,True,False,True,False,False,False,False,False,False,False,False,True]
(0.02 secs, 255,288 bytes)

So to enumerate the billionth value (1 000 000 000 th) in the enumeration, Feat only takes 255kb.

LeanCheck is not really suitable for this kind of thing, it works best when working with just thousands of values.

Indicentally, Feat (Functional Enumeration of Algebraic Types) was created by Jonas Duregard, the very creator of this ticket.

rudymatela commented 4 years ago

The following is done:

Just the following remains:

Now, where is that email... :thinking: :mag:

bfrk commented 4 years ago

Actually you can be purely functional and still do enumeration of very late values efficiently in terms of memory.

Yes. My judgement about functional vs. imperative was premature.

Unfortunately FEAT is not suitable for me because it requires Typeable.

Instead I took a closer look at smallcheck. It also works in constant memory, even if I let it run for minutes. It is internally based on LogicT which does most of the heavy lifting, so the generators in smallcheck are relatively straight forward. LogicT in turn uses a very clever continuation based approach. The only problem with smallcheck is the state space explosion.

I am currently working on a fork of smallcheck where I keep the idea of implementing the generators in terms of a ReaderT over LogicT, but make it size based rather than depth based, and also adopt many of the ideas of leancheck to limit the growth of the "tiers". The code is not yet ready for publication but I hope I'll get there.

rudymatela commented 4 years ago

The last item in the TODO list is complete:

one case LeanCheck wasn't able to find a real counterexample.

I'd like to see that!

@jwaldmann I'll add it as an example at some point then. But I can't promise that it'll be soon.

@jwaldmann It's six months later, but here's the example where LeanCheck goes out of memory before reaching a bug:

https://github.com/rudymatela/leancheck/blob/master/bench/dets.hs

The source is commented with the background story.

The bug is sufficiently big that even if LeanCheck didn't go out of memory, it would be impractical to reach the bug by testing. (Each test takes ~3s to run in this example in real life IIRC.)

rudymatela commented 4 years ago

Closing now as this is prominently documented in the README and in Haddock.