oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues
https://github.com/oasis-tcs/sarif-spec
Other
166 stars 47 forks source link

Fuzzy partial fingerprint matching and fingerprint versioning #164

Closed Evmaus-MS closed 6 years ago

Evmaus-MS commented 6 years ago

In the SARIF Spec, we cover some of the usage of fingerprints by result management systems. We also have a namespacing concept.

There are two scenarios I think we may want to provide guidance on or a convention for when providing fingerprints for a tool:

First, we should provide some guidance on versioning fingerprints, and how a result management system should read a versioned fingerprint--e.x. if you have an older fingerprinting algorithm and make some improvements to it in a tool update, you will want to be able to say "If 'Fingerprint Version 2' is present, prefer matching on it. Otherwise, match on 'Fingerprint Version 1'". My suggestion would be to use a particular namespacing convention, e.x. the tool should output "FingerprintName/v1", "FingerprintName/v2" if the results management system should interpret the fingerprints in that way. We may also want to suggest that a results management system should treat "FingerprintName" as "FingerprintName/v1" so that version information can be added after the original tool iteration.

Second, it seems possible that a tool may want to provide multiple values, of which "at least one" should match. An example of this would be providing context before and after a result, where only one of the two is necessary for the partial fingerprint to be considered a match. (Consider Semmle's hash based fingerprinting algorithm described in section 3.C here: https://semmle.com/wp-content/uploads/2015/01/team-insight.pdf ). I'd suggest we again use namespacing conventions for this, perhaps allowing "FingerprintName/0", "FingerprintName/1", "FingerprintName/2"--where any one of the three can match for "FingerprintName" to be considered a match.

We'd also probably want to cover the appropriate way to combine them (do you version and then group, or group and then version, and what's the meaning of either).

This may fall into the "guidance on usage of SARIF" rather than the specification itself, but this seems like something it'd be good to at least give some advice on.

ghost commented 6 years ago

@Evmaus-MS @michaelcfanning

Hi Everett, that's good input.

First off, it seems that your versioning suggestion is about fingerprints (§3.19.10), while your "at least one of" suggestion is about partial fingerprints (§3.19.11). If that's true, we wouldn't have to worry about how to combine the two proposed naming conventions; they apply to different properties.

I think the versioning convention for fingerprints is a good candidate for inclusion in the spec proper.

I'm less sure about the "at least one of" convention for partial fingerprints. The spec says (§3.19.11):

To make use of the information, if any, embodied in the property names, a result management system requires knowledge of the naming convention used by the SARIF producer. A result management system with that knowledge MAY use the property names to decide which partial fingerprints to include in its fingerprint computation. A result management system lacking that knowledge SHALL include all the partial fingerprints in its fingerprint computation.

The point is, to use the property names at all, the result management system needs to know the tool's partial fingerprint naming convention. In that case, the tool could just as well name the partial fingerprints "PrecedingRegionHash" and "FollowingRegionHash", and the result management system could accept a match on either one of them. So a naming convention such as "RegionHash/1", "RegionHash/2" isn't strictly necessary.

Even if such a naming convention for partial fingerprints is not strictly necessary, it might still be useful if the tool produced more than one such set of related partial fingerprints ("Foo/1", "Foo/2", "Bar/1", "Bar/2", "Bar/3"). In that case, the result management system could say "This is a match if at least one of each set of related partial fingerprints are equal" -- for example, if both "Foo/2" and "Bar/3" match.

michaelcfanning commented 6 years ago

@lgolding I don't understand the net outcome here that you are proposing. Are you agreeing that we will provide versioning for partial fingerprints. I am also not sure myself that we need to do anything with fingerprints. The fingerprints data are final, computed unique (or aspirationally unique) identifiers. If you have a match for any of them, you presumably have matched the result.

ghost commented 6 years ago

TL;DR:

I would agree that both partial fingerprint names and (full) fingerprint names could have a "version" component. I don't agree with the notion of "grouping" partial fingerprints.

Details:

It's possible that Everett's original issue description wasn't precise in its use of the terms "fingerprint" and "partial fingerprint". I was responding to what I took to be his proposal, based on the terminology he used:

With that in mind, here is my (hopefully clarified) response:

  1. Fingerprints are indeed final, computed values that reliably distinguish logically distinct results. But there might have more than one fingerprinting algorithm, and each one might go through versions. So I was agreeing to what I thought Everett was proposing, namely, to allow the property names in the result.fingerprints object to have a "version" component (e.g, "myAlgorithm/v1", "yourAlgorithm/v2").

  2. At the same time, I was pushing back on the notion of naming related partial fingerprints with a common base name. I thought Everett was proposing to structure the property names in the result.partialFingerprints object like this: "regionHash/1", "regionHash/2". I argued that I didn't see why that was better than "precedingRegionHash", "followingRegionHash".

  3. As to whether to version partial fingerprints: I think it less likely that a partial fingerprint would need to be versioned than that a (full) fingerprint would need to be versioned. Our canonical example for a partial fingerprint -- something that a result management system couldn't infer from the contents of a log file -- is a "forbidden word" detected by a geopolitical/cultural sensitivity checker. The result might be

    {
    "ruleId": "POL0001",
    "message": {
    "text": "Don't use the word 'foo'."
    },
    "partialFingerprints": {
    "forbiddenWord": "foo"
    }
    }

    But I suppose even partial fingerprint algorithms could be versioned. For example, you might decide that your "precedingRegionHash" partial fingerprint should include the preceding 5 lines instead of the preceding 10 lines. So sure, I'd allow a version component in partial fingerprint names.

ghost commented 6 years ago

@Evmaus-MS @michaelcfanning

To my great surprise, the SARIF v2 spec already supports your versioned fingerprints! See §3.19.10, "fingerprints property":

Each property name in this object SHALL be a hierarchical string (§3.4.4). A result management system MAY use the property names to identify the method used to calculate the fingerprint.

EXAMPLE: In this example, the producer has calculated a fingerprint using version 2 of a fingerprinting method it refers to as "contextRegionHash":

{
   "fingerprints": {
     "contextRegionHash/v2": "097886bc876fe"
   }
}
ghost commented 6 years ago

@Evmaus-MS @michaelcfanning

And §3.19.11 says the same thing for result.partialFingerprints.

michaelcfanning commented 6 years ago

Yay! we're done. btw, Larry, if you're interested, I can provide a clear example of the value of grouped partial fingerprints offline (based on some practical experimentation in our emerging SARIF pipeline).

ghost commented 6 years ago

@michaelcfanning Let me clearly understand what you mean by "done". My response to Everett's proposal was:

  1. We should allow versioned fingerprints --DONE
  2. We should allow versioned partial fingerprints -- DONE
  3. We should not bother with "grouped" partial fingerprints.

So if you agree with all three of my points (in particular #3), we are indeed done.

But you just said that you do see value in "grouped" partial fingerprints. If so, we are not done, right?

ghost commented 6 years ago

@michaelcfanning ... well, I suppose that once you allow hierarchical names in result.fingerprints and result.fingerprints, you can use them for anything, including grouping, for example:

regionHash/preceding/v1
regionHash/following/v2

So in that sense, we are indeed done. But @Evmaus-MS was asking for guidance on the naming convention, and on how to combine the components of the hierarchical name:

We'd also probably want to cover the appropriate way to combine them (do you version and then group, or group and then version, and what's the meaning of either).

So in that sense, we're not done.

ghost commented 6 years ago

@michaelcfanning @Evmaus-MS

My proposal:

This allows you to group both partial fingerprints and fingerprints (for what that's worth). It's easier and more consistent than restricting fingerprint property names to two components (which is what you'd have to do if you for some reason wanted to prohibit grouping of fingerprints).

michaelcfanning commented 6 years ago

Yes

Get Outlook for iOShttps://aka.ms/o0ukef


From: Larry Golding notifications@github.com Sent: Sunday, June 3, 2018 8:55:53 AM To: oasis-tcs/sarif-spec Cc: Michael Fanning; Mention Subject: Re: [oasis-tcs/sarif-spec] Fuzzy Partial Fingerprint Matching and Fingerprint Versioning (#164)

@michaelcfanninghttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmichaelcfanning&data=02%7C01%7CMichael.Fanning%40microsoft.com%7C08a12868d84c4912e1ab08d5c96a7d06%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636636381562288498&sdata=hWPy151jbcczW5WhKyPt6nRRkfB4tIF4IgxQqPK3BRY%3D&reserved=0 @Evmaus-MShttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FEvmaus-MS&data=02%7C01%7CMichael.Fanning%40microsoft.com%7C08a12868d84c4912e1ab08d5c96a7d06%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636636381562288498&sdata=SbFKtx14n2TH3Pt4EvDaWWpOYJq9oMJVLeMBttSxPFc%3D&reserved=0

My proposal:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foasis-tcs%2Fsarif-spec%2Fissues%2F164%23issuecomment-394171726&data=02%7C01%7CMichael.Fanning%40microsoft.com%7C08a12868d84c4912e1ab08d5c96a7d06%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636636381562288498&sdata=mVZvvoxJ8lToUKDA5VooG%2FxoBn3UfUku5%2BFd5ZYBFkM%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM5lt2cKdbCyGiKjH1THl5XZoiD7khndks5t5AcJgaJpZM4T1qX7&data=02%7C01%7CMichael.Fanning%40microsoft.com%7C08a12868d84c4912e1ab08d5c96a7d06%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636636381562288498&sdata=XOAOkKdbqrlmRBadtY3GT7k%2FqTsHy2s68gF5sylVzH4%3D&reserved=0.