Small counterexample found by Hedgehog but not by LeanCheck

You may remember me saying a while ago that I added some LeanCheck tests to the dejafu testsuite, and they actually found bugs, whereas QuickCheck did not. I took that as fairly strong evidence that random value generation did not work well in that particular case.

I've just been experimenting with converting those tests to use Hedgehog, as I added some other Hedgehog tests and it seemed silly to depend on three different property testing libraries. I expected Hedgehog to do as poorly as QuickCheck, so imagine my surprise when Hedgehog found a bug triggered by a small counterexample, which LeanCheck didn't!

Here's the property and the counterexample found by Hedgehog: screenshot

And indeed it is a bug. In this function, I am missing a case (BlockedSTM _, WillSTM) -> True.

The generated values all seem pretty small to me: DepState [] [] [], 0, 0, BlockedSTM [TWrite 0], STM [TCatch [TRead 0] Nothing] []

Here's the LeanCheck property. Are the Listable instances formulated poorly? Or could it be because the LeanCheck version of the property uses (==>), which doesn't retry with new input if the predicate fails; whereas Hedgehog does?

In this case, the counter-example is just too big for LeanCheck.

Here is a mock property that simulates your bug:

prop :: DepState -> TVarId -> Id -> ThreadAction -> ThreadAction -> Bool
prop ds
     (TVarId (Id Nothing 0))
     (Id Nothing 0)
     (BlockedSTM [TWrite (TVarId (Id Nothing 0))])
     (STM [TCatch [TRead (TVarId (Id Nothing 0))] Nothing] [])
     | ds == DepState (M.fromList []) (Set.fromList []) (M.fromList [])
     = False
prop _ _ _ _ _ = True

Even with 5 000 000 tests, LeanCheck does not find the bug:

> checkFor 5000000 prop
+++ OK, passed 5000000 tests.

If we simplify the property to take only the ThreadAction arguments

prop45 :: ThreadAction -> ThreadAction -> Bool
prop45 (BlockedSTM [TWrite (TVarId (Id Nothing 0))])
       (STM [TCatch [TRead (TVarId (Id Nothing 0))] Nothing] []) = False
prop45 _ _ = True

LeanCheck reports:

> checkFor 5000000 prop45
*** Failed! Falsifiable (after 3877262 tests):
(BlockedSTM [TWrite 0]) (STM [TCatch [TRead 0] Nothing] [])

It needed 3 877 262 tests. The tuple (BlockedSTM [TWrite 0], STM [TCatch [TRead 0] Nothing] []) appears as the 3877262th element in the (ThreadAction, ThreadAction) Listable enumeration. It has size 3+5=8. I conjecture that for the 5-argument property, LeanCheck will take between 10 to 100 million tests to find the bug (you'll probably run out of memory before then).

> take 9 $ map length (tiers :: [[(ThreadAction, ThreadAction)]])
[121,748,2168,4998,13022,44300,193051,976794,5354127]

The enumeration of ThreadActions grows too fast because the datatype is too wide. The blow-up is kind of inevitable and not the fault of your Listable instances. For the type of 5-tuples of the full property arguments, the tier of size 8 has 60 644 874 elements!

> take 9 $ map length $ (tiers :: [[(DepState, TVarId, Id, ThreadAction, ThreadAction)]])
[121,1474,9923,51149,229135,949572,3786689,14990753,60644874]

Can the Listable instances be improved?

Not to the point of finding the bug. (Unless you cheat by omitting less relevant constructors on the definition of Listable ThreadAction.)

You can rewrite Listable DepState without using >< and nested tuples:

instance Listable SCT.DepState where
  tiers = mapT (\(a,b,c) -> SCT.DepState a b c) tiers

This is just aesthetic, the result is the same.

Instances for ThreadId and others could have been written as:

instance Listable D.ThreadId where
  tiers = reset $ cons1 D.ThreadId

Again, just aesthetic, the result is the same. cons1 adds 1 to size and reset removes it.

Your definition of Listable Set has repetitions, as you are constructing sets from lists. [0,1] and [1,0] will yield the same set. LeanCheck provides setCons and setsOf which are useful when building sets. An improved listable instance without repetitions is given below:

instance (Ord v, Listable v) => Listable (Set v) where
  tiers = setCons Set.fromList
  -- or alternatively: tiers = mapT Set.fromList $ setsOf tiers

It does make quite a significant difference to the number of DepStates in each tier:

> take 12 $ map length $ (tiers :: [[DepState]]) -- with repetitions
[1,4,16,59,206,695,2287,7387,23524,74090,231323,717212]

> take 12 $ map length $ (tiers :: [[DepState]]) -- without repetitions
[1,4,15,54,183,604,1952,6212,19543,60940,188705,581103]

A similar thing could be done to improve the Map instance. However, LeanCheck does not provide something readily available to build Maps. This is something I may add in the near future.

I was thinking about this. Here is some extra information.

--- Quick-and-dirty function to check the size of a Listable value
size :: (Listable a, Eq a) => a -> Int
size x = fromJust $ L.findIndex (x `elem`) tiers

> size $ DepState (M.fromList []) (Set.fromList []) (M.fromList [])
0
> size $ TVarId (Id Nothing 0)
0
> size $ Id Nothing 0
0
> size $ BlockedSTM [TWrite (TVarId (Id Nothing 0))]
3
> size $ STM [TCatch [TRead (TVarId (Id Nothing 0))] Nothing] []
5

So, the total size of the counterexample is 8.

Given that

> take 9 $ map length $ (tiers :: [[(DepState,TVarId,Id,ThreadAction,ThreadAction)]])
[121,1474,9923,51149,229135,949572,3786689,14990753,60644874]

The counterexample will appear between 20 018 817 and 80 663 690 tests.

If we use the Set enumeration without repetitions:

> take 9 $ map length $ (tiers :: [[(DepState,TVarId,Id,ThreadAction,ThreadAction)]])
[121,1474,9802,49554,217375,883594,3470014,13608974,54942235]

The counterexample will appear between 18 240 909 and 73 183 143. Not a big difference. By defining Listable Map without repetitions, this number should go down, but I don't think it will be enough to find the counterexample without running out of memory.

A slightly smaller counterexample is if the last parameter is STM [TRead (TVarId (Id Nothing 0))] [], the TCatch isn't necessary. What's important is that it's two transactions which overlap in TVars, where the first blocks.

I'm a little surprised Hedgehog didn't find this smaller one, but maybe the default number of shrinking steps allowed is too low.

Hey @barrucadu, LeanCheck actually finds a few counterexamples to DejaFu's property when configured to test up to 2 500 000 test cases:

$ git clone git@github.com:barrucadu/dejafu.git
$ cd dejafu
$ git checkout e8634810765c182af96cf85dd2f92c6c34787994
$ cd dejafu-tests
$ cabal sandbox add-source ../dejafu
$ cabal sandbox add-source ../hunit-dejafu
$ cabal sandbox add-source ../tasty-dejafu
$ cabal install --only-dependencies
$ vi Common.hs # then change 2500 to 2500000
$ cabal build
$ ./dist/build/dejafu-tests/dejafu-tests --select-tests="dependent ==> dependent"
Test Cases:
  Properties:
    SCT:
      :dependent ==> dependent': [Failed]
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TRead 0]) (STM [TWrite 0] [])
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TRead 0]) (BlockedSTM [TWrite 0])
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (STM [TRead 0] [])
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (STM [TWrite 0] [])
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (BlockedSTM [TRead 0])
Failed for (DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (BlockedSTM [TWrite 0])

         Test Cases  Total      
 Passed  0           0          
 Failed  1           1          
 Total   1           1

(I assumed the bug is fixed on master/HEAD, so I checked out the changeset you quoted earlier.)

I also ran all other properties with 1 500 000 tests, and I only get errors on "dependent ==> dependent'".

As you can see above, I had to actually change the source code to configure the number of tests. However there are a few command line switches on dejafu-tests that could potentially be used by the leancheck test-framework/hunit test driver to make it easier to activate an "extensive test" mode.

$ ./dist/build/dejafu-tests/dejafu-tests --help
Usage: dejafu-tests [OPTIONS]
                   --help                                       show this help message
  -j NUMBER        --threads=NUMBER                             number of threads to use to run tests
                   --test-seed=NUMBER|random                    default seed for test random number generator
  -a NUMBER        --maximum-generated-tests=NUMBER             how many automated tests something like QuickCheck should try, by default
                   --maximum-unsuitable-generated-tests=NUMBER  how many unsuitable candidate tests something like QuickCheck should endure before giving up, by default
  -s NUMBER        --maximum-test-size=NUMBER                   to what size something like QuickCheck should test the properties, by default
  -d NUMBER        --maximum-test-depth=NUMBER                  to what depth something like SmallCheck should test the properties, by default
  -o NUMBER        --timeout=NUMBER                             how many seconds a test should be run for before giving up, by default
                   --no-timeout                                 specifies that tests should be run without a timeout, by default
  -l               --list-tests                                 list available tests but don't run any; useful to guide subsequent --select-tests
  -t TEST-PATTERN  --select-tests=TEST-PATTERN                  only tests that match at least one glob pattern given by an instance of this argument will be run
                   --jxml=FILE                                  write a JUnit XML summary of the output to FILE
                   --jxml-nested                                use nested testsuites to represent groups in JUnit XML (not standards compliant)
                   --plain                                      do not use any ANSI terminal features to display the test run
                   --color                                      use ANSI terminal features to display the test run
                   --hide-successes                             hide sucessful tests, and only show failures

The switches are -a, -s or -d. To make it easy to try more tests with LeanCheck, I could try to implement the use of these switches (by reading the global hunit/test-framework config), but I am not familiar with hunit/test-framework to do it on the spot. Maybe I'll do that one day if I get the time.

I have added a mapCons function to LeanCheck. You can use it to build a Listable instance for maps like so:

instance (Ord k, Listable k, Listable v) => Listable (Map k v) where
  tiers = mapCons M.fromList

It avoid repetitions. So counterexamples should appear earlier (an in fact they do).

Using your original Listable definitions for Set and Map, your quoted counterexample appears as the 1 340 809 th test:

> check prop
*** Failed! Falsifiable (after 1340809 tests):
(DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (STM [TRead 0] [])

After changing the definitions to use setCons and mapCons, it appears as the 842 017 th test:

> check prop
*** Failed! Falsifiable (after 842017 tests):
(DepState {depCRState = fromList [], depMVState = fromList [], depMaskState = fromList []}) 0 0 (BlockedSTM [TWrite 0]) (STM [TRead 0] [])

Using these improved instances and setting LeanCheck to run (¿just?) a million tests is sufficient to find the bug.

I plan to release a new version of LeanCheck shipped with mapCons soon. For now, it is found on the master branch.

rudymatela / leancheck

Small counterexample found by Hedgehog but not by LeanCheck #4

Can the Listable instances be improved?