squizlabs / PHP_CodeSniffer

PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.
BSD 3-Clause "New" or "Revised" License
10.67k stars 1.48k forks source link

cyclomatic complexity of ternary/null coalescing in control structures #3501

Open mmarton opened 2 years ago

mmarton commented 2 years ago

Describe the bug

This might not be a bug, I'm just curious of your opinion

After https://github.com/squizlabs/PHP_CodeSniffer/issues/3469 multiple part of my code is reporting too much cyclomatic complexity. i have something similar:

if (($variable1ThatCanBeNullOrFloat ?? 0.0) !== 0.0) {
    $context->addViolation('description');
}

if (($variable2ThatCanBeNullOrFloat ?? 0.0) !== 0.0) {
    $context->addViolation('description2');
}
// 6 more time

Until now one of this statements was +1 to complexity, not it's +2 because the null coalescing. But it's still 2 options here. stepping into the if or stepping it over.

Yes I can rewrite it to if ($variable1ThatCanBeNullOrFloat !== null && $variable1ThatCanBeNullOrFloat !== 0.0) but it's a lot more character and I like to spare my keyboard.

what do you think?

gsherwood commented 2 years ago

I think the new change is correct as it's more accurately counting the conditions in the IF that would evaluate it to TRUE or FALSE. Keen to hear what @MarkBaker thinks.

MarkBaker commented 2 years ago

There are a number of ways of writing this code:

$variable1ToTest = $variable1ThatCanBeNullOrFloat ?? 0.0;
if (($variable1ToTest !== 0.0) {
   $context->addViolation('description');
}

would be another way of expressing the same code logic using an intermediate assignment; and which clearly should increase the complexity by two - once for the null coalescence expression, and once for the if statement. Internally, PHP is doing just this, though the intermediate variable is kept hidden. Combining the two into a single line like if (($variable1ThatCanBeNullOrFloat ?? 0.0) !== 0.0) {...} doesn't reduce that complexity, it just "hides" it: the null coalescence is still evaluated with two possible paths, and the if condition is still tested with its two paths.

I believe that the complexity is still reflected accurately here - as least as accurately as Cyclomatic Complexity can be measured - in reflecting both the null coalescence expression and the if condition.

The CYC count isn't 100% accurate: coding structures have evolved since Henry McCabe's original 1976 paper on the subject. There is some argument on whether "Decisions involving compound predicates like those found in high-level languages like IF cond1 AND cond2 THEN ... should be counted in terms of predicate variables involved, i.e. in this example one should count two decision points, because at machine level it is equivalent to IF cond1 THEN IF cond2 THEN ...." (quoting from Wikipedia). This would be the case for your if ($variable1ThatCanBeNullOrFloat !== null && $variable1ThatCanBeNullOrFloat !== 0.0) alternative... however, this takes no account of lazy evaluation for || within an if evaluation... it's probably more pertinent to measuring NPath Complexity than the more simplistic Cyclomatic Complexity. Nor does the CYC calculation make any allowance for early returns (McCabe's definition of complexity is based one one entry point, one exit point within the code), or jumps (goto). And if I had my way nested ternaries should increment the complexity logarithmically, rather than simply by one for each level of nesting.

At the end of the day, Cyclomatic Complexity is a guideline that we can use to assess the maintenance risks of our code, not a hard and fast rule. Even the warning and error levels that we apply are just guidelines based on general principles of what is considered complex.

If it's any consolation; the most complex method in my most popular OSS library has a CYC of over 21 million. I've tried refactoring it, but any refactoring has a significant adverse affect on performance. That CYC makes it a pain to maintain; but a 3x performance overhead would make it impossible to run. I choose to live with that maintenance risk for the benefit of performance.

mmarton commented 2 years ago

Okay, i can accept that argument, but still weird that some operators count and some doesn't. Anyway, since you changed the calculation making it harder to stay inside the recommendations, wouldn't it make sense to also increase the warning/error limit?

mmarton commented 2 years ago

And another theoretical question:

Shouldn't optional chaining counted too? return $object->getX()?->getY() is technically return $object->getX() !== null ? $object->getX()->getY() : null

MarkBaker commented 2 years ago

Anyway, since you changed the calculation making it harder to stay inside the recommendations, wouldn't it make sense to also increase the warning/error limit?

This is probably a call for Greg to make.

In McCabe's original paper, he refers to as an upper threshold of 10 as a “reasonable, but not magical, upper limit”. But a subsequent presentation that he gave on the topic ('Software Quality Metrics to Identify Risk') and published as a NIST advisory used the following table to assess risk based on complexity:

CYC Risk
1-10 A simple program without much risk
11-20 More complex, moderate risk
21-50 Complex, high risk program
51+ Untestable program, very high risk

Steve McConnell in his book "Code Complete" was stricter:

CYC Risk
1-5 The routine is probably fine
6-10 Start to think about was to simplify the routine
11+ Break part of the routine into a second routine, and call it from the first routine

More recently, Phil Koopman seems to suggest thresholds of 10-15, and 30-50 (so potentially 15 and 30) although he's thinking in terms of embedded systems

Personally, I'd have no problem with setting the thresholds at 15 and 30 rather than 10 and 20; to make allowance for more modern coding structures and expressions.

MarkBaker commented 2 years ago

Shouldn't optional chaining counted too? return $object->getX()?->getY() is technically return $object->getX() !== null ? $object->getX()->getY() : null

It probably should, because it's an early break from the full expression, one that I'd forgotten while I was looking at what was missing from the sniff.

There's other omissions too:

An expression like;

$file = fopen($filename, 'r')
    or exit("unable to open file ($filename)");

or the use of break or continue while in a loop.

The CYC check is fairly simplistic in that it simply looks for certain tokens inside a function/method, and increments the CYC value by 1 each time it finds one of those tokens. The two cases that I mentioned above couldn't be counted in the same way because they're dependent on the context, not simply the presence of the tokens in the code. break can also occur in the cases of a switch statement, and shouldn't be counted when it's in that context, but only when the context is exiting from a loop. The or token can also occur when combining conditions in an if statement, so it shouldn't increment the CYC value in that context. Rewriting the sniff itself to take that context check into account would make it a lot more complicated.

Other possible omissions are subject to debate. In a switch or match statement, some people would argue that each case increments CYC by one, regardless of the number of conditions being tested in that case; others argue that a case with multiple conditions should increment for each condition. e.g.

    return match(strtolower(substr($monthName, 0, 3))){
        'apr', 'jun', 'sep',  'nov'  => 30,
        'jan', 'mar', 'may', 'jul', 'aug',  'oct', 'dec'  => 31,
        'feb' => is_leap_year($year) ? 29 : 28,
        default => throw new InvalidArgumentException("Invalid month"),
    }

should increment CYC by 13 rather than by 4 (+1 for the leap year ternary)

A throw is another case that sees some argument about whether it should increment CYC as well.


Getting back to your actual question though, I'd say that yes, the nullsafe operator (?->) should increase CYC by 1. I'll look to submit a PR for that.

mmarton commented 2 years ago

brake and continue in a loop are (almost) always inside of an if statement, so that would be double counting imo

MarkBaker commented 2 years ago

Normally; but you could get very dirty code like

$data = [
    'Hello',
    'World',
    'Terminate',
    'Goodbye',
];

foreach ($data as $entry) {
    switch ($entry) {
        case 'Terminate': break 2;
        default: echo $entry, PHP_EOL;
    }

}
mmarton commented 2 years ago

Yes, and I can use goto and eval() too :) But I don't think there is an common part of the ones who are doing that and who are caring about code quality :D

MarkBaker commented 2 years ago

@gsherwood Any opinion on changing the Warning and Error thresholds for CYC?

jrfnl commented 2 years ago

@MarkBaker The thresholds can be customized per project via the project custom ruleset: https://github.com/squizlabs/PHP_CodeSniffer/wiki/Customisable-Sniff-Properties#genericmetricscyclomaticcomplexity

mmarton commented 2 years ago

@MarkBaker The thresholds can be customized per project via the project custom ruleset: https://github.com/squizlabs/PHP_CodeSniffer/wiki/Customisable-Sniff-Properties#genericmetricscyclomaticcomplexity

yes, I can, and I'm doing it right now in my projects. But the question here is should the recommended default values go up with the recent changes in the calculation. And I think they should.