SARIF 2.2 proposal: `security-severity` field for reportingDescriptors and results

adityasharad commented 11 months ago

Originally filed as https://github.com/oasis-tcs/sarif-spec/issues/598, split into more focused component issues.

GitHub Advanced Security's code scanning feature recognises security-severity levels, currently read from the properties bag of a SARIF reportingDescriptor or result object, and renders these in the UI. GitHub CodeQL populates this property in its SARIF output, and the property is recognised for other code scanning tools.

Docs: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning#reportingdescriptor-object and https://github.blog/changelog/2021-07-19-codeql-code-scanning-new-severity-levels-for-security-alerts/

Using the properties bag was a pragmatic measure to provide this functionality without requiring a SARIF spec change. For SARIF 2.2 I propose we make security-severity an accepted property directly on a reportingDescriptor or a result.

Although GitHub currently accepts numeric values in this property, and translates them on the backend into enum values, we recommend moving away from this for the spec, and instead directly accept an enum of possible values: critical, high, medium, low, or omitted entirely (e.g. for non-security rules/results). SARIF producer tools can supply this property to indicate the tool's own confidence in the security severity of the result, or for a rule, the security severity of the results indicated by this rule. This avoids constraining tool developers into using the same translation methodology that we follow for CodeQL (which is based on CVSS scores, which all producers may not wish to follow). We can update CodeQL to begin populating these enum values using the new property instead of the numeric values it does today.

cc @michaelcfanning

michaelcfanning commented 10 months ago

The four levels of severity here correspond to an ask to add fatal/noncontiuable/critical to the current (business process level) level property. So that consistency is good.

I have two questions: 1) could we render this property simply as severity? does potential severity in a reliability setting, for example, not warrant its own setting? 2) alternately, should we consider simply burning cvss into SARIF as a standard property (the application of which strictly to security is obviously implied). I believe your numeric security-severity is a CVSS value?

ShiningMassXAcc commented 10 months ago

Overall, I think we can have more discussion here.

We've talked primarily about level but there is also kind 'above' leveland rank 'below' level.
+1 to Michael's comments about other domains, like reliability and privacy, that also use static analysis. Having a security specific item to me doesn't help align the larger ecosystem when thinking about this. The conversations about cvss being a bit polarizing, I think highlight this.
For me, this determination really lives outside of the result schema today. Here are a few examples that come to mind.

Ruleset files for VS that are used for both security, quality, and even just style. There are default values for levelthat match to SARIF for what the tool produces, but the local ruleset file is used by clients to determine how this particular codebase or organization interprets that individual rule. https://learn.microsoft.com/en-us/visualstudio/code-quality/using-rule-sets-to-group-code-analysis-rules?view=vs-2022

Thinking even about codeql within Microsoft, we have the idea of sdl-required and sdl-recommended groups of issues. Even with these, it is often up to the individual organizations and business owners to determine which set will result in folks showing up as red in S360.

Overall, my initial inkling is to perhaps add Criticalto Level, although this doesn't lend itself to some of the default rendering, we have in IDEs today. I'd love @michaelcfanning to talk to precedent on using other artifacts outside of the sarif results to determine this interpretation (eg. like https://learn.microsoft.com/en-us/visualstudio/code-quality/using-rule-sets-to-group-code-analysis-rules?view=vs-2022 above)

ShiningMassXAcc commented 10 months ago

Giving a detailed reliability example - if this helps:

My team surfaces crashes in SARIF. There are a few ways that we see dev teams and analytics determine severity.

How many hits are in production?
How do the hits impact customers - eg. is this an enterprise issue, is this a crash that has bad side effects or is relatively recoverable?
Are hits unique to a branch? Can we isolate the crash to code before it makes it to customers - if this is a pre-prod crash?
How does an individual team interpret this is highly subject to teams' general policy for bug bar / ship readiness / etc

What do we want SARIF and SARIF clients to be able to do to determine and communicate severity? Having an error level crash versus a warning level crash - seems a bit off compared to medium severity versus high severity crash. Not sure how we imagine these two working in tandem?

michaelcfanning commented 10 months ago

Good example. Our first principle for level is that it is communicates a sort of business process severity. Error breaks the build. Warning does not, etc. The separate proposed severity is an attempt to measure a level that's intrinsic to the problem (assuming it isn't a true positive!). Coupled with something like precision, you might end up with the following sort of thing: We treat this class of problem as a Note because the precision is low and the severity is medium. We treat this other class of finding as Error because the severity is high and the precision is medium. Etc.

@adityasharad, my concrete proposal is that we approve precision and severity (or security-severity if you prefer, and follow the model of rank, i.e., a value of 0 to 100.0 inclusive. Now all these things are consistent. We additional update the boilerplate language for all three of these properties to note that consumers can sort across these knobs as they are expressed. But consumers can also bucket any/all of these properties into low (0-25), medium (26 - 50), high (51 - 75) or critical.

'Very high' may be preferable to 'critical' as a suggested label.

These specific details, i.e., precise information on how to bucketize findings, will address concerns you raised on the call, I hope. @stacywray

oasis-tcs / sarif-spec

SARIF 2.2 proposal: `security-severity` field for reportingDescriptors and results #612