sageserpent-open / americium

Generation of test case data for Scala and Java, in the spirit of QuickCheck. When your test fails, it gives you a minimised failing test case and a way of reproducing the failure immediately.
MIT License
15 stars 1 forks source link

Mismatch between required cases limit and actual number of cases examined to yield best shrinkage. #46

Closed sageserpent-open closed 2 years ago

sageserpent-open commented 2 years ago

This test here: DifferenceTest.mustNotBeZero requires a huge cases limit of 3500 cases to successfully minimise the test failure.

However, looking at the report here: Report for Americium Shrinking on "difference must not be zero" indicates that only 1615 cases were run overall, and the failures were discovered on the 993rd and 1292nd attempt - so clearly such a high limit wasn't warranted.

However, even a reduction of the cases limit to 3400 fails to yield the best minimisation - the first of the two tests is still detected, as expected, but the second one fails to show up.

I suspect this is a reoccurrence of the old starvation problem again...

sageserpent-open commented 2 years ago

Seen in Americium 1.4.6 .

sageserpent-open commented 2 years ago

Bisection shows the bug was introduced in 1.4.3, this is also observed in the companion test DifferenceTest.mustNotBeOne .

sageserpent-open commented 2 years ago

Now added a test to TrialsSpec to reproduce this - more detailed bisection shows this bug was introduced in commit 75c649c7877be1be27a84dbae08003a47c9e9a21 .

At first glance, this is probably due to the filtration of failing test cases during shrinkage when no progress is made on complexity or factory inputs cost. As this is quite literally downstream of the cases generation (which is where the case limit is enforced), then this could be responsible - the filtration throws away the additional test cases allowed when the case limit is increased, so the number of cases supplied doesn't appear to change, but there are more opportunities for better shrinkage, thus the result changes.

This isn't wrong in terms of shrinkage, but is definitely not what the user expects.

sageserpent-open commented 2 years ago

This is caused by TrialsImplementation doing filtering of potential shrunk cases downstream of the logic where a) the cases limit is applied and b) test case starvation is detected. So allowing more cases affords a bigger budget for getting a shrunk case that beats the best one seen yet - hence better shrinkage.

Some experiments were performed where the shrunk case filtering was moved upstream prior to actually supplying the test case to the test lambda - this is feasible simply because the filtering condition is completely oblivious to either the content of the test case or the outcome of the test. This allows the cases limit to take the shrinkage filtering into account, which is good, but doesn't play well with the starvation detection logic. Excluding the shrinkage filtering from starvation detection fixes that issue, but reintroduces the pathologically long shrinkage situation (possibly even makes it an infinite loop, not sure as I haven't had the patience to let it run for up to an hour).

Need to think about this....

sageserpent-open commented 2 years ago

There is a subtlety to this - while one would expect that providing a higher budget for the number of cases should still hit the same failing test cases, and thus provoke the same shrinkage sequence to start with, it is entirely possible that when the shrinkage would have hit a dead end and thus either changed tack or gone with the best shrunk case so far, having a higher budget might allow further progress and thus a divergence in the shrinkage sequence.

So yes, we expect to see the shrinkage sequence starting off with the same failing cases presented in the same order, but as the cases limit increases, some variation can occur later on. It is reasonable to expect that the common prefix of the shrinkage sequence should get longer as the cases limit gets higher and higher.

sageserpent-open commented 2 years ago

After several false starts, a lot of head-scratching and the benefit if hindsight, added a decent bug reproduction test in commit b15205517949f7021a56c216f8e4024f25cf6d1f that passes against code as per Americium 1.4.2, then fails in commit cc402a936576be44944919799f32847c8248d857 as per Amercium 1.4.3, then passes again in commit c927cb47f0270b8d7a8bf08a73ec2d1ce83c7680 with the fix made in c886ee427c7184f974fdae13f075ff3ce6ae132c .

sageserpent-open commented 2 years ago

This has gone out in Americium 1.5.0, release 1.5.1 was built with the back-patched test mentioned above for the sake of a release with associated testing - commit e743947f430bc0ad7064ebd63e6efae8920f457a .

sageserpent-open commented 2 years ago

As evidence, we have an updated form of the test above here: DifferenceTest.mustNotBeZero (note, this has been merged into the upstream repository).

The report here: Report for Americium Shrinking on "difference must not be zero" shows that with a cases limit of 300, the first failing case is found at the 251st attempt.

Overall, around 1516 cases were run with 3 shrinkages; as (1 + 3 + 2) * 300 = 1800 cases would be the expected worst case, this seems reasonable. NOTE: the 1 + 3 + 2 is the count of the initial failing case, the following 3 shrinkages and the final 2 abortive attempts to shrink using the 'normal' and 'panic' shrinkage strategies.

Dropping the cases limit in that test to 240 does not find a failing test case, as expected.

Increasing the cases limit yields the following shrinkage sequences:

Cases limit of 300:

[251] [182, 182]
Shrinking ... [436] [155, 155]
Shrinking ... [720] [10, 10]
Shrinking ... [1009] [10, 10]
Cases limit of 1000:

[251] [182, 182]
Shrinking ... [436] [155, 155]
Shrinking ... [720] [10, 10]
Cases limit of 2000:

[251] [182, 182]
Shrinking ... [436] [155, 155]
Shrinking ... [720] [10, 10]
Shrinking ... [3121] [10, 10]
Shrinking ... [5140] [10, 10]

As a side note, it is still possible to see situations where increasing the cases limit can reveal a failing case that occurs way before the case limit, but cutting back the case limit closer to the failing case then fails to find that case - this is due to 'case starvation', where cases are not submitted because they are duplicates of previously submitted cases, or are not considered to be shrunk compared with the previous best shrunk case.

Starvation has a budget that is based on the cases limit, so there is a secondary effect where for some formulations of Trials, increasing the cases limit can suddenly jump past starvation to yield a failing test case, but as starvation is not counted as an attempt, the failing test case comes earlier than expected when looking at the cases limit. For now, this is just going to be a subtlety that has to be tolerated, outside of the rather extreme test suite that this ticket deals with, this has never been an issue.