Add config option to count timeouts as mutant escapes

CyberiaResurrection commented 4 years ago

Is your feature request related to a problem? Please describe. I'm concerned that mutation runs on CI environments may inadvertently be overstating MSI scores.

I keep seeing a wide disparity in number of timeouts between running Infection locally (2 timeouts out of ~1200 mutants) vs running on CI (98 timeouts out of ~1000 mutants - that job only mutates files that were changed in the PR) - a timeout rate nearly 60x greater on CI. This is most likely due to the differing processing power available to each case.

Describe the solution you'd like As I understand it, Infection currently counts a mutant against the run's MSI score if it didn't produce some visible misbehaviour (test failure, fatal error, timeout).

I'm asking for an option, settable in either infection-json.dist or supplied via command line, that alters the sense of MSI calculation. This would simply not count timeouts as "visible misbehaviour".

Referring back to the sample Travis job I attached, only the 871 killed mutants (and the zero fatal errors) would count towards the overall MSI (which would be 87.6% under my proposal) and covered-code MSI (which would be 89.1%), not the reported values of 97% and 98%, respectively.

Although my initial use case arises out or running under a CI environment, I would be reluctant to default it to being set for CI runs. That feels too much like setting policy, not merely providing mechanism.

Describe alternatives you've considered

Dug through Infection docs, couldn't find any way to kitbash this together.
Bump timeout on CI runs. Didn't do this because mutation runs on Travis take long enough already.

Additional context Sample CI timeout-laden run: https://travis-ci.org/Algo-Web/POData-Laravel/jobs/659174053?utm_medium=notification&utm_source=github_status

theofidry commented 4 years ago

I don't think a timeout should count as an escape: if you have a mutant that makes your for condition an endless loop, you don't have any test that can effectively kill it (and neither should you).

But having a big timeout disparity in the CI definitely indicates an issue. Have you tried with increasing the timeout from 10 to a bigger number to see if it reduces them?

sanmai commented 4 years ago

Let's consider why we have timeouts. Change < to > in a cycle, and there you have it, an infinite cycle. Add some expensive and memory intensive operations inside this cycle, and there is a bomb, waiting to blow. Now you have a mutated process, running in an infinite loop, trying to consume every resource your machine has. And one can reasonably expect CI machines to not have much in this regard.

Now, consider you decided to "fix" this problem by raising a timeout. What's going to happen? Now you will have this vicious process running not for 25 seconds, but for a full minute, all the time putting your system under even greater stress. Even more stress will make every other process slower, hence causing even more timeouts. So, raising a timeout is almost never a solution.

But if you lower the timeout, or introduce a stricter memory limit, you will limit damage caused by runaway mutations, and thus the number of other timing out mutations.

Bottom line

Try to eradicate all timeout mutations, either by inserting controls, or, better, by careful use of mocks. If you know a method can only be called X times, for an input with N items, you know what to do.
Try running CI builds with --no-progress.
Try to lower timeout to the smallest reasonable value, and/or introduce a stricter memory limit.
Also try reducing --threads=4 to, say $(nproc). It might be too large.
Consider using PCOV. It is as easy to set up as Xdebug, but it is so much better.
As a last resort you can try ignore mutations causing timeouts, for certain lines and files.

What else can we do?

Seeing this issue, and experiencing this problem myself, I can tell this is actual problem, which is not well understood by the users.

It may improve situation if we would treat this "timeout" mutations not as some vague timeout, but on more concrete terms, say, as runaways, or fugitives. This way we will suggest that this mutations are not to be treated lightly.
There might be the case where Infection can't react to these runaway processes in a timely fashion because of the stress this processes cause on the system. It may worth to lower process priority for mutation processes.
It may worth adding a control for the number of these mutations specifically. For example, with --max-runaway=N you can cause a build to fail if there's more than N of them.

CyberiaResurrection commented 4 years ago

Bumping the timeout to 30 seconds does roughly halve the timeouts - 46 timeouts out of 1000 mutants on the run below, compared to 98 out of 994 mutants on the original run I linked.

https://travis-ci.org/Algo-Web/POData-Laravel/jobs/659388718?utm_medium=notification&utm_source=github_status

CyberiaResurrection commented 4 years ago

@sanmai , I originally didn't want to "fix" the problem by boosting the timeout - that's a diagnostic step that @theofidry asked me to do. My original question/proposal was a config option to count timeouts, at whatever threshold you may have set, as failed-to-detect mutants.

sanmai commented 4 years ago

If you'd want to add something like --max-runaway=N, I'd say it's a good idea. It will let you weed out botched builds for sure, where including these processes into MSI scores may or may not offer SNR improvement. Remember, there are probabilities involved: depending on the state of your project Infection may run tests and/or mutations in a different order.

But don't take only mine word for it. Let's hear what others want to say.

CyberiaResurrection commented 4 years ago

Oh yeah, definitely.

I have a co-maintainer who does tend to get a little over-excited at sharp increases in coverage, whether statement or mutant.

theofidry commented 4 years ago

I took a bit of time to think about the issues and have a bit of chat with others on the topic. There is far from being a consensus on the subject, but there is one thing that is clear to me: I'm less and less convinced that relying on the MSI, min MSI or another additional option is a good idea.

Ultimately a MSI is a very loose metric. It's a nice indicator, but people tend to take too much as "it's over X, so my code is good I can just merge it". Maybe the work here should be done not on trying to add another option to attempt to control that metric, but on the reporting itself. For example:

when displaying the run summary, reporting the % of timeouts not just their numbers
working on a way to annotate the escaped mutations directly in the PR or link to the stryker dashboard

danepowell commented 7 months ago

I would love a --max-timeouts option. On my project, I'd have zero tolerance for timeouts or runaways since they are insidious: over time, they gradually make your tests slower and slower, especially if you already have a high timeout.

infection / infection

Add config option to count timeouts as mutant escapes #1133

Bottom line

What else can we do?