oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues
https://github.com/oasis-tcs/sarif-spec
Other
169 stars 47 forks source link

SARIF 2.2 Proposal: Add clearer and explicit relationships between `guid`, `correlationGuid`, `fingerprints`, `partialFingerprints`, and `workItemUris` `result` properties. #615

Open ShiningMassXAcc opened 1 year ago

ShiningMassXAcc commented 1 year ago

All of these items are noted to be potentially used by results management systems, but these are organized flatly in the result object. For customer consumption in the end, we expect workItemUris to be most used, but there is no clear indication of how a workItemUri maps to a result management system. The fact that workItemUris is a list but guid and correlationGuid are not, indicates that perhaps we didn't have clarity on how we thought these would be used?

In particular, how do we imagine multiple work items to represent? Are these from different results systems, different sub-results within a single result, different hashing systems? My team has complex results that have multiple facets to this, but it's unclear how best to use this system to date given how clients will use these.

Some possible thoughts:

Direct SARIF producers and SARIF converters MAY but do not need to set this property. A result management system SHOULD set this property when it ingests a SARIF log file.

While I'm opening this for general discussion, at minimum, I'd like to see workItemUris have a more direct mapping to the unique identifier that the workItemUri is being tracked against. In particular, the appendix on fingerprints perhaps muddies this further.

I don't have a great implementation I like here, but I'd perhaps break all these items into a subsection that is more clearly delineated. This is perhaps way too much change - but I wanted to get a sense of how other folks consume these properties.

{
    "results": [
        {
            "ruleId": "CA2101",
            "toolIdentifiers": {
                "guid": "DEADBEEF"
            },
            "resultSystems": [
                {
                    "Name": "System1",
                    "workItemUri": "foobar.com",
                    "partialfingerprint": "asdfasdf"
                },
                {
                    "Name": "System2",
                    "workItemUri": "foobar2.com",
                    "fingerprint": "bcdebcde"
                }
            ]
        }
    ]
}

Note - I'm not beholden to this being included in 2.2, but using that for consistent titling for now.

edkazcarlson-ms commented 1 year ago

As part of the discussion I wanted to make sure some of my current understanding of guid vs correlation guids vs fingerprints vs partial fingerprints vs work items was correct. Does the SARIF spec assume the following is how sarif data is used/transformed?

  1. A tool such as PolyCheck does scans of a codebase on a particular cadence
  2. When the tool finds an issue such as an offensive word, it will create a result in a sarif log that gets transmitted to whatever result management system the dev team is using. In the result, the tool will set partial fingerprints with hashes based on properties like the word that was found offensive or the file path it was found in with the intended (but not required) purpose of being used to determine what is logically unique.
  3. When the result management system receives the file, it will assign it an arbitrary, unique GUID if one has not been set already. If it is a result management system that uses fingerprints, it will also calculate the fingerprints based on the partial fingerprints provided to make a logically distinct hash for the result.
  4. The result management system has an internal, hidden mapping of fingerprint -> (work item set). When it processes the result and gets the fingerprint, it will generate a list of candidate work items that are logically distinct (determined by the result management system) and will compare the work items it needs to the work items it has. If it finds any work items to be missing/ resolved already, it will create/unresolve the work items and present them to the user.
  5. When a consumer such as the Sarif viewer downloads a sarif file through the result management system, it will have the fingerprints and work item(s) that were created by the result management system as well as the partial fingerprints from the initial tool.

This is my current understanding of the "flow" of data Sarif assumes takes place but is this correct? Does this mean that the fingerprint logically identifies the issue at hand? My understanding of the different fields is that:

What happens in the case that the result management system does not work with fingerprints but instead works with correlation guids? The spec sheet says

Other result management systems group results into equivalence classes without associating a computed fingerprint with each result, and they denote each equivalence class with an arbitrary unique identifier. This identifier is opaque: it is ­not calculated from information stored in the result, and hence contains no readable information about the result.

so does this mean that the correlation guids are used by systems that bucket items based on other factors outside of the sarif result? Was wondering if there was a concrete example of this since I'm not sure what an "opaque" identifier would be based on.