SARIF 2.2 Proposal: Add clearer and explicit relationships between `guid`, `correlationGuid`, `fingerprints`, `partialFingerprints`, and `workItemUris` `result` properties.

All of these items are noted to be potentially used by results management systems, but these are organized flatly in the result object. For customer consumption in the end, we expect workItemUris to be most used, but there is no clear indication of how a workItemUri maps to a result management system. The fact that workItemUris is a list but guid and correlationGuid are not, indicates that perhaps we didn't have clarity on how we thought these would be used?

In particular, how do we imagine multiple work items to represent? Are these from different results systems, different sub-results within a single result, different hashing systems? My team has complex results that have multiple facets to this, but it's unclear how best to use this system to date given how clients will use these.

Some possible thoughts:

More clearly differentiate tool unique identifiers from results system identifiers. The below snippet from the description on guid clearly states this could be used by SARIF producers (the tool itself) or results management systems but only when the tool doesn't use it. By stating that a result management system SHOULD set this property ... what do they do when the tool produces this instead -> then they use fingerprints? This then muddies consistency of what results management systems will use when ingesting different types of results.

Direct SARIF producers and SARIF converters MAY but do not need to set this property. A result management system SHOULD set this property when it ingests a SARIF log file.

While I'm opening this for general discussion, at minimum, I'd like to see workItemUris have a more direct mapping to the unique identifier that the workItemUri is being tracked against. In particular, the appendix on fingerprints perhaps muddies this further.

I don't have a great implementation I like here, but I'd perhaps break all these items into a subsection that is more clearly delineated. This is perhaps way too much change - but I wanted to get a sense of how other folks consume these properties.

{
    "results": [
        {
            "ruleId": "CA2101",
            "toolIdentifiers": {
                "guid": "DEADBEEF"
            },
            "resultSystems": [
                {
                    "Name": "System1",
                    "workItemUri": "foobar.com",
                    "partialfingerprint": "asdfasdf"
                },
                {
                    "Name": "System2",
                    "workItemUri": "foobar2.com",
                    "fingerprint": "bcdebcde"
                }
            ]
        }
    ]
}

Note - I'm not beholden to this being included in 2.2, but using that for consistent titling for now.

As part of the discussion I wanted to make sure some of my current understanding of guid vs correlation guids vs fingerprints vs partial fingerprints vs work items was correct. Does the SARIF spec assume the following is how sarif data is used/transformed?

A tool such as PolyCheck does scans of a codebase on a particular cadence
When the tool finds an issue such as an offensive word, it will create a result in a sarif log that gets transmitted to whatever result management system the dev team is using. In the result, the tool will set partial fingerprints with hashes based on properties like the word that was found offensive or the file path it was found in with the intended (but not required) purpose of being used to determine what is logically unique.
When the result management system receives the file, it will assign it an arbitrary, unique GUID if one has not been set already. If it is a result management system that uses fingerprints, it will also calculate the fingerprints based on the partial fingerprints provided to make a logically distinct hash for the result.
The result management system has an internal, hidden mapping of fingerprint -> (work item set). When it processes the result and gets the fingerprint, it will generate a list of candidate work items that are logically distinct (determined by the result management system) and will compare the work items it needs to the work items it has. If it finds any work items to be missing/ resolved already, it will create/unresolve the work items and present them to the user.
When a consumer such as the Sarif viewer downloads a sarif file through the result management system, it will have the fingerprints and work item(s) that were created by the result management system as well as the partial fingerprints from the initial tool.

This is my current understanding of the "flow" of data Sarif assumes takes place but is this correct? Does this mean that the fingerprint logically identifies the issue at hand? My understanding of the different fields is that:

Guid is used for identifying a result uniquely, even if it is the same error as another result. Not meant for bucketing/work item creation but more for debugging/ searching. This can be set by the tool that created the result, but since the data used to make the guid isn't tied to anything, the result management system can make this if missing.
Partial fingerprints are made by the tool in order to extract information that the tool developer thinks is useful for uniquely identifying result. However, the result management system ultimately has the final say in how these are used.
Fingerprints can be calculated based on the partial fingerprints, however the result management system owners may have a different philosophy behind what is unique and so they can choose what fields to use to calculate the partial fingerprints.
Work items are made by the result management system how they want, likely based on the fingerprint but could be based off of anything. Because of this, with the exception of the work item uri identifying the work item, work items are not tied to any particular field in the sarif spec with the intention of this being more free-form(?)

What happens in the case that the result management system does not work with fingerprints but instead works with correlation guids? The spec sheet says

Other result management systems group results into equivalence classes without associating a computed fingerprint with each result, and they denote each equivalence class with an arbitrary unique identifier. This identifier is opaque: it is not calculated from information stored in the result, and hence contains no readable information about the result.

so does this mean that the correlation guids are used by systems that bucket items based on other factors outside of the sarif result? Was wondering if there was a concrete example of this since I'm not sure what an "opaque" identifier would be based on.

oasis-tcs / sarif-spec

SARIF 2.2 Proposal: Add clearer and explicit relationships between `guid`, `correlationGuid`, `fingerprints`, `partialFingerprints`, and `workItemUris` `result` properties. #615