Does there exist an analogue of QuickCheck's `choose` function?

langfield commented 3 months ago

I've been using LeanCheck for a few months now and it's wonderful! :) However I often find myself wanting to generate an example of some kind of data structure and then sample an element from it. With LeanCheck this is pretty cumbersome, I often just generate an extra element and then add it to the data structure, but that isn't very nice because it ends up "near the boundary".

Is there something like QuickCheck's choose that will let me do this?

langfield commented 2 months ago

To motivate this a little, very often we have data invariants associated with some type, and we only want to generate valid inhabitants. With QuickCheck, we can do e.g.

data BST a = Empty | Node (BST a) a (BST a) deriving (Eq, Show)

sizedBST :: (Arbitrary a, Ord a, Random a, Enum a) => 
            a -> a -> Int -> Gen (BST a)
sizedBST lo hi 0 = return Empty
sizedBST lo hi n = 
  oneof [do
           v <- choose (lo,hi)
           l <- sizedBST lo (pred v) (n `div` 2)
           r <- sizedBST (succ v) hi (n `div` 2)
           return $ Node l v r]

which generates a valid binary search tree with values within certain bounds. (See here for the notes from which this was taken.)

I cannot figure out a good way to do this with LeanCheck. 😞

rudymatela commented 2 months ago

I've been using LeanCheck for a few months now and it's wonderful! :)

@langfield, I am glad you appreciate LeanCheck. :grin:

To motivate this a little, very often we have data invariants associated with some type, and we only want to generate valid inhabitants.

When you have data invariants, I find it best to delegate dealing with those to the smart constructors for the data type. Expanding your BST example a bit:

data BST a  =  Empty
            |  Node (BST a) a (BST a)
  deriving (Eq, Show)

-- inserts an element into a BST
insert :: Ord a => a -> BST a -> BST a
insert x Empty  =  Node Empty x Empty
insert x (Node l y r)
  | x < y  =  Node (insert x l) y r
  | x > y  =  Node l y (insert x r)
  | otherwise  =  Node l y r -- no-op for repeated elements

-- converts a list of elements into a BST
fromList :: Ord a => [a] -> BST a
fromList  =  foldr insert Empty

Naïve approach: respect data invariants with some repetitions

With insert, we can write the following Listable instance for our BST:

-- variation 1
instance (Listable a, Ord a) => Listable (BST a) where
  tiers  =  cons0 Empty \/ cons2 insert

This one respects the data invariant, all generated BSTs are valid.

With fromList, we can have an even shorter instance:

-- variation 2
instance (Listable a, Ord a) => Listable (BST a) where
  tiers  =  cons1 fromList

These two instances are equivalent and will yield the same list (and tiers) of BSTs. One disadvantage is that there are repetitions:

> take 12 list :: [BST Int]
[ Empty
, Node Empty 0 Empty
, Node Empty 0 Empty
, Node Empty 1 Empty
, Node Empty 0 Empty
, Node (Node Empty 0 Empty) 1 Empty
, Node Empty 0 (Node Empty 1 Empty)
, Node Empty (-1) Empty
, Node Empty 0 Empty
, Node (Node Empty 0 Empty) 1 Empty
, Node Empty 0 (Node Empty 1 Empty)
, Node Empty (-1) (Node Empty 0 Empty)
]

These repetitions yield from the fact that some different inputs to fromList yield the same BST: e.g. [0,2,1] and [2,0,1]:

> fromList [0,2,1]
Node (Node Empty 0 Empty) 1 (Node Empty 2 Empty)

> fromList [2,0,1]
Node (Node Empty 0 Empty) 1 (Node Empty 2 Empty)

Sets of values: respect data-invariants with no repetitions

If you want to avoid these repetitions, you can use the setCons constructor to guarantee that fromList will be applied to set-unique lists.

-- varation 3
instance (Listable a, Ord a) => Listable (BST a) where
  tiers  =  setCons fromList

This instance respect the data invariants and has no repetitions:

> take 12 list :: [BST Int]
[ Empty
, Node Empty 0 Empty
, Node Empty 1 Empty
, Node (Node Empty 0 Empty) 1 Empty
, Node Empty (-1) Empty
, Node Empty (-1) (Node Empty 0 Empty)
, Node Empty 2 Empty
, Node (Node Empty 0 Empty) 2 Empty
, Node Empty (-1) (Node Empty 1 Empty)
, Node Empty (-2) Empty
, Node Empty (-1) (Node (Node Empty 0 Empty) 1 Empty)
, Node Empty (-2) (Node Empty 0 Empty)
]

This enumeration is quite restrictive, you only get a single possible arrangement for a BST given a set of values. Depending on your application, this can be good or bad. When in doubt, I would say this is the way to go. This is what I use in leancheck-instances package to generate values of the Set datatype.

Sets of values: respect data-invariants with no repetitions and all variations of set-equivalent BSTs

If you are interested in generating all possible ways that a BST can be arranged, you can do the following:

import Test.LeanCheck
import Data.List (inits, tails, sort)

-- generates all possibilities of arranging a BST
-- from a given set of elements
possibilities :: Ord a => [a] -> [BST a]
possibilities  =  ps . sort
  where
  ps []  =  [ Empty ]
  ps xs  =  [ Node l x r
            | (ys,x:zs) <- zip (inits xs) (tails xs)
            , l  <-  ps ys
            , r  <-  ps zs
            ]

-- variation 4
instance (Listable a, Ord a) => Listable (BST a) where
  tiers  =  map concat
         .  mapT possibilities
         $  setsOf tiers

This generates unique trees in respect to structural equality:

> take 12 list :: [BST Word]
[ Empty
, Node Empty 0 Empty
, Node Empty 1 Empty
, Node Empty 0 (Node Empty 1 Empty)
, Node (Node Empty 0 Empty) 1 Empty
, Node Empty (-1) Empty
, Node Empty (-1) (Node Empty 0 Empty)
, Node (Node Empty (-1) Empty) 0 Empty
, Node Empty 2 Empty
, Node Empty 0 (Node Empty 2 Empty)
, Node (Node Empty 0 Empty) 2 Empty
, Node Empty (-1) (Node Empty 1 Empty)
]

... there are repetitions wrt. the set of elements.

Variation 4 uses setsOf and mapT which are also built-in'to LeanCheck. The weird map concat at the end is so the nested tiers of lists of BSTs ([[ [BST a] ]]) is flattened back into "regular" tiers of BSTs ([[BST a]]).

Depending on what you want, you should use something like variation 3 or variation 4. In doubt, I would go with 3 as it is simpler.

Hopefully this helps. :smile:

rudymatela commented 2 months ago

At some point in the next weeks, I think I should perhaps:

[ ] add an example illustrating this in the eg/ folder
[ ] improve the discussion on the document where I talk about data invariants
[ ] improve discussion of types with data invariants on Haddock documentation:
- [ ] on the documentation of Listable
- [ ] ... deriveListable
- [ ] ... and a minor fix on bagCons doc to make it consistent with its sister functions.

langfield commented 2 months ago

This enumeration is quite restrictive, you only get a single possible arrangement for a BST given a set of values.

Yes I came up with variation 3 myself, but opted to use QuickCheck instead because of this limitation. However, your variation 4 looks like exactly what I'm looking for!

langfield commented 2 months ago

Thanks so much for writing out such a detailed response. Is the stuff in doc/ and eg/ accessible from the hackage docs? I'll have to read them over.

Feel free to close this issue as you see fit. 👍🏻

rudymatela commented 2 months ago

Is the stuff in doc/ and eg/ accessible from the Hackage docs?

Sort of:

... but the rendering is not nice. You get the raw markdown and sources without syntax highlighting. These are better read here on GitHub:

rudymatela commented 2 months ago

... I kind of forgot to answer your original question though...

Does there exist an analogue of QuickCheck's choose function?

Not really. For now I don't think there's anything analogous to choose.

The (my :sweat_smile:) intended way to generate specialized enumerations for LeanCheck is to:

use smart constructors paired with setCons, bagCons, mapCons, setsOf, etc...; or
when 1 does not work, manipulate the enumeration directly: in LeanCheck enumeration is constructed as a simple list of lists of values: more specifically a possibly infinite list of finite lists.

You can get good enumerations with 1 and 2 in most cases. Sometimes requiring less thinking than in QC, sometimes require more thinking than in QC.

I don't think creating something analogous to choose is impossible in LeanCheck, but for it I would have to implement a monadic interface for LC first (#16). This one is quite far down in my TODO list though... :sweat_smile:

langfield commented 2 months ago

... I kind of forgot to answer your original question though...

Does there exist an analogue of QuickCheck's choose function?

Not really. For now I don't think there's anything analogous to choose.

The (my 😅) intended way to generate specialized enumerations for LeanCheck is to:
1. use smart constructors paired with `setCons`, `bagCons`, `mapCons`, `setsOf`, etc...; or

2. when 1 does not work, manipulate the enumeration directly: in LeanCheck enumeration is constructed as a simple list of lists of values: more specifically a possibly infinite list of finite lists.
You can get good enumerations with 1 and 2 in most cases. Sometimes requiring less thinking than in QC, sometimes require more thinking than in QC.

I don't think creating something analogous to choose is impossible in LeanCheck, but for it I would have to implement a monadic interface for LC first (#16). This one is quite far down in my TODO list though... 😅

Yes it seems superfluous, I hadn't realized it was quite so easy to manipulate the enumeration directly. I haven't written any Listable instances that weren't straightforward boilerplate uses of consN.

By the way, it has occurred to me that LeanCheck must "miss" values because of Cantor's theorem. For example if you were enumerating Set Int, you are essentially attempting to generate $\mathcal{P}(\mathbb{N})$, no? I searched for "diagonalization", "Cantor", and "power set" in your thesis and didn't see anything addressing this issue. Is it not a problem in practice?

(Please correct if I have made an incorrect assumption.)

langfield commented 2 months ago

Is the stuff in doc/ and eg/ accessible from the Hackage docs?

Sort of:
* https://hackage.haskell.org/package/leancheck/src/doc/

* https://hackage.haskell.org/package/leancheck/src/eg/
... but the rendering is not nice. You get the raw markdown and sources without syntax highlighting. These are better read here on GitHub:
* https://github.com/rudymatela/leancheck/tree/master/eg

* https://github.com/rudymatela/leancheck/tree/master/doc

Ah yeah I should have said "easily discoverable", haha. Although a more prudent person than myself would have read through directories with names like those!

rudymatela commented 2 months ago

By the way, it has occurred to me that LeanCheck must "miss" values because of Cantor's theorem.

Yes and no. It misses values fome some types, but not for Set Int actually...

For example if you were enumerating Set Int, you are essentially attempting to generate Powerset(N) , no?

Int is a finite set. So the powerset of Ints is a finite set too! This makes it countable/enumerable.

Of course, you probably mean Set Integer. However the Haskell Set type can ony hold finite sets. We can represent arbitrary finite sets such as Set.fromList [2,4,6], or Set.fromList [2,3,5,7] but we can never represent infinite sets such as "the set of even numbers"; "the set of primes"; or even "the full set of Integers". This is enough to make Cantor's argument not to apply in this case. (Set is even capped at maxBound::Int elements, however, just restricting instances to be finite is enough for Cantor's argument not to apply.)

The enumerations of Set Int and Set Integer are complete: provided you have infinite time (and memory) you will reach any finite set you can think of. I think even Set (Set Integer) is complete as well.

The Haskell Set is to the maths set what Double is to the real number set.

So what values does LeanCheck misses, and for which types?

LeanCheck is able to completely enumerate finite lists of integers: [Integer]. There are no holes in this enumeration when you consider only finite lists. However, with the list type, we can indeed represent infinite lists, such as the list of even positives [2,4..] or the list of odd positives [1,3..]. Infinite lists are not included in the enumeration. These are exactly the values that LeanCheck misses wrt. the diagonal argument.

LeanCheck is primarily a testing tool, so not having infinite lists being generated in the standard/default enumeration is a good thing. We want our strict properties to terminate:

prop_sort_idempotent :: [Int] -> Bool
porp_sort_idempotent xs  =  sort (sort xs) == sort xs

There are (lazy) properties that can/should require infinite lists, in these cases one could generate a class of infinite lists (such as periodic lists) using a wrapper newtype:

prop_requiring_infty :: Periodic [Int] -> Bool
prop_requiring_infty xs  =  ...

The Periodic newtype constructor is not available by default in LeanCheck, my point here is one could make one.

Another LeanCheck enumeration that misses a lot of values is the one for functions (not enabled by default but available in Test.LeanCheck.Function).

Summary of enumeration completeness:

Int: complete enumeration -- finite thus countable set
Integer: complete enumeration -- infinite but countable set
Double: "complete" enumeration -- finite thus countable set -- "complete" is in quotes here because we intentionally omit NaN and -0.
Rational: complete enumeration -- infinite but countable set
[Integer]: complete enumeration of finite list values (countable set) --- but no enumeration of infinite list values (impossible/uncountable/not too useful)
etc.

For recursive data types, such as lists and BSTs, enumerations are generally complete wrt. finite values. Infinite values are completely omitted.

Another example I thought of: if you are generating AVL trees, it is possible to generate all of them as the data invariant requires them to have a finite height.

... good question btw. I probably should include a discussion of this in LeanCheck's documentation at some point. For now at least it is here on the issue tracker. :smile:

langfield commented 2 months ago

However the Haskell Set type can ony hold finite sets.

Ahhhh that's quite an oversight on my part! Thanks for this explanation, this totally clears everything up. The system is much more complete than I'd hoped :)

rudymatela / leancheck