False positives replacing equivalent expression, like `Duration::ZERO` and `Default::default()`

sourcefrog / cargo-mutants

:zombie: Inject bugs and see if your tests catch them!

https://mutants.rs/

MIT License

597 stars 27 forks source link

False positives replacing equivalent expression, like `Duration::ZERO` and `Default::default()` #435

Open kevincox opened 6 days ago

kevincox commented 6 days ago

For example cargo-mutants will complain that it can replace std::time::Duration::ZERO with Default::default().

Some cases in the standard library can maybe be special cased but there are often functions that are equivalent to Default::default() but are provided because they provide a more meaningful name. In fact Default::default() is often implemented just by calling a different method like Self::empty().

I don't know if there is a general solution to this problem. Maybe there can be a basic complexity metric such that this replacement is only done if the thing being replaced is more complex than a simple field access, function call or similar. For example Duration::from_secs(0) or accessing some constant DELAY likely should also not be replaced.

kevincox commented 6 days ago

A much more complex option would be to inject some sort of check that runs the old code and checks that at least once it is different than the default value.

Although this has a number of problems:

Requires that the value implements PartialEq.
Requires injecting code into the process and a side-channel to get the result of the check.
Still not perfect as it just shows that the test don't trigger a non-default value, not that the replaced expression is equivalent to Default::default().

sourcefrog commented 6 days ago

Adding a table of heuristics that are known to be equivalent seems like the most immediately tractable thing.

I don't know about running the code as a way to detect this: in addition to the factors you name there it would also rely on observing all the side effects of the code.

I wondered if one option would be to look at the generated code (at some stage of the pipeline) and see if it's equivalent. If so we can be confident the mutation has no effect. However, that seems potentially pretty complicated as the code may not get inlined, or the name of the full symbol might still be present. Symbolic interpretation would be interesting but seems like a large expansion in complexity.

kevincox commented 6 days ago

What are your thoughts on a generic heuristic? For example FooBaz::bar(18) is "simple enough" that we just skip over it. This will definitely leave some false-negatives (ex: Duration::from_secs(1) wouldn't be replaced) but will likely avoid a lot of false-positives in unknown code.

I'm thinking that the following things would probably be considered "simple enough".

Calls to static functions where all arguments are simple expressions.
References to constants.
Literals. (Maybe with some sort of length limit for strings?)
Some whitelisted method calls like .clone() called on an otherwise simple expression.

This would notably exclude:

Anything that accesses function arguments.
Any conditionals, loops or other control flow.

sourcefrog commented 6 days ago

That would be a step beyond adding more heuristic name-based assumptions that, say, new typically means the same as default(), etc.

I think it would be important to be clear what properties we are attributing to "simple enough" expressions.

One option is "any simple expression probably has the same value as one of the generated default values", so mutating it is likely to be a false positives, i.e. FooBaz::bar(18) is probably the same as Default::default(). That seems unlikely to be true in general, and I'd hope we could minimize this by recognizing cases. The point of function calls is generally to return a non-default value or to have a side effect.

Or, you might be saying that "sufficiently simple expressions are self evident and unlikely to have any bugs, so it's fine if they're not tested." It's probably true that complex code is more likely to have bugs than simple code, but I don't think I'd go that far. For example if you have a function that returns a static string, and no test depends on the value of the string, I think that's worth at least thinking about. Conceivably we could have a default-off option to skip simple functions.

I think the way forward is to accumulate some examples from real code that show a significant number of uninteresting false positives and then think about how to reduce them.

kevincox commented 6 days ago

That's an important distinction. I think the result ends up having some benefits from both categories.

There are often values that aren't critical. Maybe I have a timeout somewhere and it isn't important if it is 10s or 15s. I don't necessarily need a test. That being said in most cases 0s is bad, but putting a test >5s seems silly. This is sort of your "unlikely to have bugs" point. In general I am not very interested in writing tests that just assert values of my "constants" (which may be calculated via functions or similar). But of course there are always cases where it does make sense (maybe the calculation of that constant is in fact complex, even if the local code is just gen_constant().

Another part of it is that in many cases I would prefer to avoid false-positives. Especially if I am new to mutation testing I might want to get more significant results. Although it may also be worth having these often false-positive available in some modes or similar for those who really want to audit and consider a long list of results. This leans more towards the first category of there is a decent chance that this is a default value or in some ways just as valid as the default value, so maybe it makes sense to suppress the possible false-positive.

Overall I think the main reason I think this may be a viable approach is the latter. Simple constant-looking expressions are less likely to be bugs and less likely to be worth testing so are false-positive prone. Whether or not they are equal to the default value is somewhat beside the point.

Seeing more examples is definitely a good idea. Then patterns can be drawn.

kpreid commented 6 days ago

Another part of it is that in many cases I would prefer to avoid false-positives. Especially if I am new to mutation testing I might want to get more significant results.

I think this is the more important consideration, and I’ll further propose that cargo-mutants should by default not mutate simple functions (if there is a workable automated definition of “simple” to apply here). My experience is that there‘s a lot of small functions that don’t feel worth writing tests for, because there is a sense in which the most reasonable test would be just reiterating the function’s code, and cargo-mutants would be a more useful tool — providing more value for the effort of running it — if it produced “more significant results” by default (and those more significant results might suggest better tests to write which end up incidentally testing the simple functions; e.g. getters might be tested as a consequence of testing the output of an algorithm that produces the struct having the getters).

Perhaps in this mode, if all mutants are caught (or even unconditionally), it could then print something like “123 mutations were not performed because they were presumed false-positive-prone. Run with --include-simple-functions to try these too.” — so that users can learn how to increase the coverage if they want to and when it’s a good time to do so.

sourcefrog commented 6 days ago

If you did not see it you might read https://mutants.rs/goals.html about my overall thinking on this balance.

sourcefrog commented 6 days ago

My experience is that there‘s a lot of small functions that don’t feel worth writing tests for, because there is a sense in which the most reasonable test would be just reiterating the function’s code... and those more significant results might suggest better tests to write which end up incidentally testing the simple functions; e.g. getters might be tested as a consequence of testing the output of an algorithm that produces the struct having the getters

I'd agree I wouldn't want to steer people towards writing unit tests for trivial functions that just recreate the code. (I wouldn't want that even regardless of mutants; those tests can if taken to extremes just make the program needlessly hard to understand and change.)

But, I'd be wondering why mutating those functions does not cause a larger test to fail. As you say: writing a test for a getter might be boring; but if the getter always returns None and no test fails that's interesting. Writing a test that checks fn timeout() { 5 } function might be boring, but if no test fails when the timeout is literally zero that seems surprising.

(Also: why is the hypothetical timeout a function not a const/static if it's always the same? If it actually reads from config or an environment variable, maybe that should be tested?)

kevincox commented 6 days ago

why is the hypothetical timeout a function not a const/static if it's always the same?

In my case this was a trait method. The trait has it as a method in case it isn't possible to be const/static (Rust is pretty particular about what can be) but the implementation isn't.