Common scoring system for vulnerability test coverage?

tomato42 commented 3 years ago

When working the security issues, we have the CVSS to gauge how severe a given issue is.

The problem is, that when a fix for an issue is released, it's not obvious what kind of test coverage was employed to ensure that the fix actually fixes the issue, that it fixes the general case, not only the specific case, or how extensive the test coverage for that issue is.

Secondly, when we consider issues in common protocols or data exchange formats, it's not uncommon that multiple implementations have the same or similar issues. So having documentation that a CVE-XXXX-YYYYY-like issue from library Z isn't also present in libraries other than Z because they test it in "such, such, and such way", would also be really useful.

(technically, this idea has overlap between the work groups, especially the Best Practices, but I'm filing it here as I'd rather keep the scope focused on security at the beginning rather than correctness in general)

So, do you think this is the best workgroup to start work on this? If yes, what would you suggest as next steps?

mprpic commented 3 years ago

@tomato42 Can you provide an example of what this would look like for some specific CVE and library? If I understand your description, you're proposing some formalized way to recognize a particular vulnerability that may affect multiple implementations (and thus have several CVEs assigned) that could then be used to show (and test) if library X is vulnerable or not. Perhaps something akin to this table of XML vulns and how they affected different Python XML parsers?

tomato42 commented 3 years ago

@tomato42 Can you provide an example of what this would look like for some specific CVE and library?

tbh, I don't have a complete set of metrics in mind, but few of the ones I think should be considered are:

presence of a unit test of the affected method
negative unit test of the affected method
test coverage of the affected method (different kinds, like modified condition/decision coverage, table coverage, not just line coverage)
parameter value coverage of the affected method (how many different classes of inputs there are vs how many are tested), i.e. property based testing
mutation score for the test cases that cover the affected method
presence of fuzz tests for the method
- how extensive
presence of performance tests of the affected method or ones that use the affected method
tests for timing side-channel of the affected method and code that uses it
tests for memory access invariance of the affected method
static analysis tools error count for the affected method
memory management checks of the affected method (no memory leaks, no unbounded memory growth, no uninitialised memory use)
interoperability testing with other implementations
integration tests that exercise the affected method
presence of a formal machine-validated proof of correctness for the affected code

so for many fixes/bugs the score would be rather low; for many issues some of those things are completely irrelevant, I was thinking of an open-ended scale, starting at 0, for no tests, and then growing up for better and better test coverage

the problem is that some of them (like parameter coverage, or mutation score) are more subjective than others

If I understand your description, you're proposing some formalized way to recognize a particular vulnerability that may affect multiple implementations (and thus have several CVEs assigned) that could then be used to show (and test) if library X is vulnerable or not.

well, I'd argue that if you have at least two implementations of the same format you can have the same bug in both of them

Perhaps something akin to this table of XML vulns and how they affected different Python XML parsers?

I may do, but I'm not sure if it would be illustrative... also, I'm not familiar with them, so it would be hard for me to say how I should score them

ossf / wg-vulnerability-disclosures

Common scoring system for vulnerability test coverage? #74