nick8325 / quickcheck

Automatic testing of Haskell programs.
Other
714 stars 119 forks source link

Add maxFailPercent argument. #239

Closed stevana closed 5 months ago

stevana commented 5 years ago

I've been reading about statistical testing in the Cleanroom software engineering literature recently. Overall their approach is close to what QuickCheck does (especially paired with state machine testing), but they differ in that they see the results of testing as a measure of quality (e.g. the program is 90% reliable with 95% confidence) not a proof of correctness (i.e. success or failure).

Inspired by this, I've added a maxFailPercent argument to Args which works like this:

> quickCheckWith stdArgs { maxFailPercent = 10 } $ forAll (choose (0, 10)) $ \i -> i `elem` [0..9]
*** Failed! Falsifiable (after 16 tests):  
10
*** Failed! Falsifiable (after 36 tests):  
10
*** Failed! Falsifiable (after 48 tests):  
10
*** Failed! Falsifiable (after 52 tests):  
10
*** Failed! Falsifiable (after 69 tests):  
10
*** Failed! Falsifiable (after 74 tests):  
10
*** Failed! Falsifiable (after 80 tests):  
10
*** Failed! Falsifiable (after 82 tests):  
10
+++ OK, passed 92 tests; 8 failed (8%).

--

> quickCheckWith stdArgs { maxFailPercent = 10 } $ forAll (choose (0, 10)) $ \i -> i `elem` [0..9]
*** Failed! Falsifiable (after 5 tests):  
10
*** Failed! Falsifiable (after 7 tests):  
10
*** Failed! Falsifiable (after 9 tests):  
10
*** Failed! Falsifiable (after 9 tests):  
10
*** Failed! Falsifiable (after 46 tests):  
10
*** Failed! Falsifiable (after 47 tests):  
10
*** Failed! Falsifiable (after 49 tests):  
10
*** Failed! Falsifiable (after 49 tests):  
10
*** Failed! Falsifiable (after 58 tests):  
10
*** Failed! Falsifiable (after 58 tests):  
10
*** Failed! Passed only 47 tests; 10 failed (10%) tests.

The default maxFailPercent is 0 and retains the old behaviour.

Thoughts?

nick8325 commented 5 years ago

This looks very nice! I'm afraid I haven't had time yet to check the code in detail, but I will do and I think this feature would be great to have.

Have you seen the recent support for statistically sound checking of coverage criteria in QuickCheck, namely the checkCoverage function? It would be cool if that could also be used here! For example, if we want to check that the program is 90% reliable, then a test run in which 89 out of 100 tests passes is not convincing evidence that the program is wrong.

stevana commented 5 years ago

Glad you like it!

Have you seen the recent support for statistically sound checking of coverage criteria in QuickCheck, namely the checkCoverage function? It would be cool if that could also be used here! For example, if we want to check that the program is 90% reliable, then a test run in which 89 out of 100 tests passes is not convincing evidence that the program is wrong.

I've seen checkCoverage in the changelog, but only recently read the documentation properly and not looked at the implementation in detail yet.

I agree that it would be cool to have something along the same lines! One thing we should be careful about is: not to confuse maxFailPercent = 10 with the program having a reliability of 90%.

There are many different reliability models, the paper I linked to above mentions four of them, and I believe each one of them has a different notion of how many tests are enough to assert (or refute) a given reliability percentage.

The way I see maxFailPercent is that it makes it easier to collect the stats that you need to calculate the reliability (given some model). I think it would be nice to have QuickCheck be able to talk about reliability, but that would require some more work.

(By the way, I've collected some links to other papers and resources on statistical testing here, if you find that kind of stuff interesting.)

thalerjonathan commented 5 years ago

Thank you so much for this feature @stevana, I was looking for exactly this!

I am doing research of using QuickCheck to test pure functional Agent-Based Simulation (ABS), where we sometimes need exactly this feature due to ABS stochastic nature - so again thank you, now I don't have to implement it.

stevana commented 5 years ago

@thalerjonathan: Cool, I'm happy you found another use case for this feature. If possible please do share some of your work when you're done.

@nick8325: I've now seen John's "Building on developers' intuition" talk, and I have a better understanding of how checkCoverage works. I'm still not sure how achieve something similar for reliability though.

thalerjonathan commented 5 years ago

@stevana After a bit of experimenting and playing around I have moved in my use cases more towards cover with or without checkCoverage:

cover 
:: Testable prop 
=> Double -- ^ The required percentage (0-100) of test cases.
-> Bool   -- ^ True if the test case belongs to the class.
-> String -- ^ Label for the test case class.
-> prop  
-> Property

As first argument I pass the required percentage, the 2nd argument is then the actual property e.g. does the test fail yes/no, 3rd argument is a label for the test case and the 4th argument is True. This is not the use as QuickCheck has intended it but it somewhat emulates your behaviour: the tests ALL go through but the percentage which fail/succeed are captured by cover and reported in case they are below the required percentage.

A difference to your approach is, that always a fixed number of default 100 (or whatever you set it to) are run, whereas in your implementation one ends up with e.g. 100 + number of failed cases in total. Another difference is, that it does NOT lead to an overall failed test in case the required percentage is not met. This can be somewhat achieved by using checkCoverage but that might result in many more (or less) runs than the default 100 (or whatever you configure). As far as I understood, the benefit from using checkCoverage is that you have a statistical robust guarantee (using sequential statistical tests to check hypotheses) that your test-case won't reach the required percentage. I think to have this guarantee is extremely valuable which is the reason for me to rather prefer the built-in functionality.

MaximilianAlgehed commented 5 months ago

It appears this is subsumed by checkCoverage and the PR is stale. If this is still a feature people are interested in it would be great to open an issue where we can discuss the design, and how it relates to checkCoverage in detail.

I'm going to close this as a PR for now, but don't get the impression that the discussion isn't a welcome one for an issue or a new PR that is up to date with the current version of the code base!