sixty-north / cosmic-ray

Mutation testing for Python
MIT License
565 stars 57 forks source link

Estimating mutation score by sampling #490

Closed tomato42 closed 4 years ago

tomato42 commented 4 years ago

I was thinking about the issue #484, and I think there is a bit of a problem with it: what if the test coverage changes but the code does not? How to see if the changes to test coverage are ok, when no application code was changed?

Maybe the cosmic-ray could execute a random selection of the tests and calculate confidence interval for it?

The formula I found, uses: p for percentage of mutants that survived q for percentage of mutants that were killed (i.e. 1 - p) n number of mutants tested N number of mutants total z z-score (scaling factor for the given confidence level, 1.65 for 90%, 1.96 for 95%, 2.58 for 99%)

So if I have 8000 mutants, tested randomly 40 of them, 20% of them survived and I want to know a 95% confidence interval for that 20% I calculate: sqrt((p q)/n) z (1 - sqrt(n/N)) = sqrt(0.2 0.8 / 40) 1.96 (1 - sqrt(40/8000)) = 0.115

so by executing 40 tests, I know that the real mutation score of this test suite is 20% ± 11.5% (95% confidence)

The nice thing is that if the execution was selecting the tests at random, that estimation could be simply a switch to cr-report to base it off of total jobs vs complete jobs and already calculated survival rate.

abingham commented 4 years ago

You're absolutely right that using coverage to drive CR is a heuristic; it generally won't tell us exactly what tests need to be run since logical paths through the code might alter the relationships between tests and coverage.

I'm not confident enough in my statistics to assess if what you're proposing is mathematically sound, but I certainly like the approach in principle. Assuming the mathematics is (or could be made) correct, I think it's worth pursuing. I think you could very easily write an interceptor that a) identifies the random mutations to run and b) marks all others as skipped. You'd then need a specialized (but probably quite simple) reporting tool to interpret the results.

On Sat, Nov 2, 2019 at 1:45 AM Hubert Kario notifications@github.com wrote:

I was thinking about the issue #484 https://github.com/sixty-north/cosmic-ray/issues/484, and I think there is a bit of a problem with it: what if the test coverage changes but the code does not?

Maybe the cosmic-ray could execute a random selection of the tests and calculate confidence interval for it?

The formula I found, uses: p for percentage of mutants that survived q for percentage of mutants that were killed (i.e. 1 - p) n number of mutants tested N number of mutants total z z-score (scaling factor for the given confidence level, 1.65 for 90%, 1.96 for 95%, 2.58 for 99%)

So if I have 8000 mutants, tested randomly 40 of them, 20% of them survived and I want to know a 95% confidence interval for that 20% I calculate: sqrt((p q)/n) z (1 - sqrt(n/N)) = sqrt(0.2 0.8 / 40)

  • 1.96 * (1 - sqrt(40/8000)) = 0.115

so by executing 40 tests, I know that the real mutation score of this test suite is 20% ± 11.5% (95% confidence)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sixty-north/cosmic-ray/issues/490?email_source=notifications&email_token=AAATK6ASCI4RNII55AFSGU3QRTETZA5CNFSM4JIBS2XKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HWH56YA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATK6A3IYLJUTUHO67JNZTQRTETZANCNFSM4JIBS2XA .

tomato42 commented 4 years ago

I've updated the question later: if the cosmic-ray exec would execute test cases in random order, then cr-report could simply take the results from DB and calculate the confidence interval (probably with a switch)

so it wouldn't be a new interceptor, but rather ability to estimate results from a partial run (like in CI, where you can run the tests for 20-30 minutes and make do with what you got)

tomato42 commented 4 years ago

I've proposed PR to implement it

Here's one lecture that goes into error estimation: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals_print.html