Clarify use (or extend design) of SARIF to express hierarchical diagnostics.

oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues

https://github.com/oasis-tcs/sarif-spec

Other

168 stars 47 forks source link

Clarify use (or extend design) of SARIF to express hierarchical diagnostics. #572

Open michaelcfanning opened 1 year ago

TartanLlama commented 1 year ago

Context: in Visual Studio we're working on hierarchical diagnostics to present compiler errors in a more structured way than plain text. Clang is also working on a similar project.

A C++ example:

struct dog{};
struct cat{};

void pet(dog);
void pet(cat);

template <class T>
concept pettable = requires(T t) { t.pet(); };

template <pettable T>
void pet(T);

struct donkey {};

int main() {
    pet(donkey{});
}

MSVC outputs an error message like this in plain text:

error C2665: 'pet': no overloaded function could convert all the argument types
message : could be 'void pet(cat)'
message : 'void pet(cat)': cannot convert argument 1 from 'donkey' to 'cat'
message : No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
message : or       'void pet(dog)'
message : 'void pet(dog)': cannot convert argument 1 from 'donkey' to 'dog'
message : No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
message : or       'void pet(T)'
message : the associated constraints are not satisfied
message : the concept 'pettable<donkey>' evaluated to false
message : 'pet': is not a member of 'donkey'
message : see declaration of 'donkey'
message : while trying to match the argument list '(donkey)'

However, there's a logical structure which isn't represented here, which looks something like:

error C2665: 'pet': no overloaded function could convert all the argument types
- could be 'void pet(cat)'
- 'void pet(cat)': cannot convert argument 1 from 'donkey' to 'cat'
  - No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
- or 'void pet(dog)'
- 'void pet(dog)': cannot convert argument 1 from 'donkey' to 'dog'
  - No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
- or 'void pet(T)'
- the associated constraints are not satisfied
  - the concept 'pettable' evaluated to false
  - 'pet': is not a member of 'donkey'
  - see declaration of 'donkey'
  - while trying to match the argument list '(donkey)'

In order to represent this hierarchy, we want to be able to encode it in SARIF. However, there's currently no facility for this.

The way we do this today is by placing the nested diagnostics in a relatedLocations object. To each related location, we add a properties property which contains a nestingLevel property. This is essentially the same as the nestingLevel property of threadFlowLocation objects ([3.38.10]).

For example:

{
   "ruleId":"C2665",
   "level":"error",
   "message":{
      "text":"'pet': no overloaded function could convert all the argument types"
   },
   "locations":[
      {
         "physicalLocation":{
            "artifactLocation":{
               "uri":"diag.cpp"
            },
            "region":{
               "startLine":35,
               "startColumn":4
            }
         }
      }
   ],
   "relatedLocations":[
      {
         "id":0,
         "physicalLocation":{
            "artifactLocation":{
               "uri":"diag.cpp"
            },
            "region":{
               "startLine":11,
               "startColumn":5
            }
         },
         "message":{
            "text":"could be 'void __cdecl pet(struct dog)'"
         }
      },
      {
         "id":1,
         "physicalLocation":{
            "artifactLocation":{
               "uri":"diag.cpp"
            },
            "region":{
               "startLine":35,
               "startColumn":4
            }
         },
         "message":{
            "text":"'void __cdecl pet(struct dog)': cannot convert argument 1 from 'int' to 'struct dog'"
         },
         "properties":{
            "nestingLevel":1
         }
      },
      {
         "id":2,
         "physicalLocation":{
            "artifactLocation":{
               "uri":"diag.cpp"
            },
            "region":{
               "startLine":35,
               "startColumn":8
            }
         },
         "message":{
            "text":"No constructor could take the source type, or constructor overload resolution was ambiguous"
         },
         "properties":{
            "nestingLevel":1
         }
      }
   ]
}

I'd like for all C++ compilers to agree on how to encode hierarchical diagnostics, and having it standardised in SARIF seems to be the best way to achieve this.

cjdb commented 1 year ago

Clang is in the middle of working out how to restructure its diagnostics internally, which is driven by the desire for this.

michaelcfanning commented 1 year ago

Sorry for the delay on this topic. We are discussing at tomorrow's TC meeting.

TartanLlama commented 6 months ago

There's some appetite in the ISO C++ standards group for converging on SARIF with an extension like this. Has this item been discussed by the SARIF group yet?

davidmalcolm commented 4 months ago

Sy posted this updated proposal to SG15: P3358R0 SARIF for Structured Diagnostics; I'm taking a look from the GCC side.

davidmalcolm commented 3 months ago

@TartanLlama Hi, thanks for your various proposals.

FWIW I now have a working (but messy) prototype of your latest proposal in GCC: Screenshot from 2024-07-17 14-41-28 ...but I have some concerns about the way the representation you've picked might interact with other relationships between the locations in a sarifResult - in particular, the way the nestingLevel attrs in the property bags seem to require an ordering of the relatedLocations for them to have meaning.

As I understand it, every sarifResult has its own "namespace" of numeric IDs for locations within it, and the relatedLocations within a sarifResult effectively form a directed graph. Currently the only relationship expressable between locations is a chain of includes; see 3.34 locationRelationship object.

I'm about to try adding support for capturing "include chains" to GCC's SARIF output, which is going to mean adding lots of relatedLocations, and it's probably going to break my nested locations prototype.

So I think this might be better expressed as a new "kind" within locationRelationship's 3.34.3 kinds property. That said, I haven't yet worked with the locationRelationship part of the spec; maybe others have opinions here?

FWIW I'm also interested in capturing information about macro expansions, which might be another kind of location relationship; see #650.

Your proposal also mentions using JSON-RPC for sending sarifResult objects one at a time. FWIW I've filed this RFE to implement something like this in GCC (no promises though!)

davidmalcolm commented 3 months ago

@TartanLlama Hi, thanks for your various proposals. [...snip...] ...but I have some concerns about the way the representation you've picked might interact with other relationships between the locations in a sarifResult - in particular, the way the nestingLevel attrs in the property bags seem to require an ordering of the relatedLocations for them to have meaning.

To clarify: clearly you want to preserve the ordering of the messages, but if, say, there are include chains in some of the messages, we need to refer to the locations of the #include directives. Would these be at the same nestingLevel as the message they relate to?

If we omit nestingLevel for such locations, then as per the proposal as written it's implicitly 0, and then the next location after it with a nestingLevel > 0 would be treated as a child of that location, not the one it's logically part of.

But ideally if a bunch of locations appear in one sarifResult referencing the same include chain (say foo.h includes bar.h and there are various issues reported in bar.h as part of a hierarchical diagnostic), presumably we want to do this "lazily" and only list each location once in relatedLocations.

Sorry if the above is unclear

TartanLlama commented 3 months ago

Thank for these thoughts, @davidmalcolm (sorry for the delay in responding, I've been on vacation).

Yeah, everything you wrote makes sense to me, I can see how the current design can have issues in these cases. Let's say we did as you suggest and made a new kind of locationRelationship. It seems sufficient to make relationships run in one direction, from parent to child. We could call it contains or something like that. I think in that case, the hierarchical example from the paper would look something like:

        "result": {
            "message": {
                "text": "'pet': no overloaded function could convert all the argument types"
            },
            "locations": [
                {
                    "id": 0,
                    "relationships": [
                        {
                            "target": 1,
                            "kinds": [
                                "contains"
                            ]
                        }
                    ]
                }
            ],
            "relatedLocations": [
                {
                    "id": 1,
                    "message": {
                        "text": "or       'void pet(_T0)'"
                    }
                   "relationships": [
                        {
                            "target": 2,
                            "kinds": [
                                "contains"
                            ]
                        }
                    ]
                },
                /*...*/
            ]
        }

Is that what you had in mind? Do you see any potential issues with this design? I'll ask the other folks at Microsoft who have been working on this.

davidmalcolm commented 3 months ago

@TartanLlama Thanks for the feedback.

FWIW I'm now tracking this idea in GCC's bugtracker as an RFE as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116253

I've attached my patch for G++ and its .sarif output there. This implements the original nestingLevel property bag property from your P3358R0 version of the proposal.

Is that what you had in mind?

Yes, very much so - thanks!

Do you see any potential issues with this design?

I'll take a look at implementing the relatedLocations version of the proposal to see how well it holds up from the GCC producer side of things.

I'll try to put together a version of the testcase that uses #include (GCC's SARIF output now captures this; see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107941)

I'll ask the other folks at Microsoft who have been working on this.

Thanks again

davidmalcolm commented 3 months ago

I'll try to put together a version of the testcase that uses #include (GCC's SARIF output now captures this; see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107941)

@TartanLlama I had a go at doing this, chopping up your test case into https://gist.github.com/davidmalcolm/bc84ca043be0c0ef3a3a4d1e9d748c2b

It turned out my GCC implementation of your nestingLevel property bag property happens to work due to my implementation deferring all related locations for #include to after everything else. The output can be seen here: https://gist.github.com/davidmalcolm/38cff5c1995d21fc7d7abbecb3e25a85; if I'm reading things right, there's a location and a bunch of relatedLocations for the hierarchical notes, then a trailing bunch of relatedLocations for the #include lines.

So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.

davidmalcolm commented 3 months ago

@TartanLlama FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.

davidmalcolm commented 3 months ago

@TartanLlama There's a whole other aspect to your proposal (beyond nested diagnostics) that I found interesting, which is that in 3.1.1. MSVC you document that MSVC can look for SARIF_OUTPUT_PIPE and send its diagnostics as JSON-RPC notifications there.

This seems really useful for IDE integration, and maybe there's an appetite for standardizing it? If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?

What you document seems to be very Windows-specific (using a UTF-16-encoded integer representation of a HANDLE). I've had a go at implementing it for GCC on Linux. I'm tracking that in GCC's bug tracker as RFE: add support for sending SARIF output via IPC. My first attempt was using a named FIFO, but that had the limitation of only being for a single tool: once the FIFO is closed, we're done. The latest patch uses unix domain sockets: If the environment has SARIF_SOCKET set to a path to a socket, then GCC diagnostics are emitted as JSON-RPC notifications to the socket in a manner akin to that for "SARIF_OUTPUT_PIPE" in P3358R0, albeit with Unix domain sockets. The patch also implements a trivial "sarif-listener" program that creates the socket and listens for notifications. I can run that in one terminal, and run various builds using GCC in another, and see the diagnostics from those builds appear as SARIF notifications in the first terminal.

My patch currently adds (and requires) the command-line option -fdiagnostics-format=sarif-socket, but it might be good to simply have the environment variable. That way an IDE could make the socket and set the env var when invoking a build framework, and then all diagnostics would go to the IDE, rather than textually, and users wouldn't have to go messing about with command-line variables on all of the different build systems that are out there. Or maybe that's too much "magic"? We could potentially do both text and socket output, but that would require more surgery to GCC's diagnostics framework.

The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?

I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool etc (and maybe also a startOfRun notification?), so that when a connection comes in to the socket it gets metadata as well as just diagnostics.

Hope this is constructive

TartanLlama commented 3 months ago

So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.

Makes sense, I guess we could specify that a producer would have to output locations in this manner, but it seems pretty fragile, better to have a more robust encoding method.

FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.

Would it help if I were to join as well?

If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?

For the SARIF standard, yeah, I think it could make sense for the spec to outline a streaming format for SARIF results, potentially based on JSON RPC. The pipe communication seems perhaps outside the purview of the SARIF format though, what do you think? For SG15, the idea is for all of this to end up in the Ecosystem IS. I think after the initial paper has been discussed by SG15, we can decide the best course of action there.

The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?

There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.

I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool

At least for our use cases we have other ways to extract the toolset information we need (e.g. from the CMake info), but this is certainly something to consider if use cases for it come up!

davidmalcolm commented 3 months ago

So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.

Makes sense, I guess we could specify that a producer would have to output locations in this manner, but it seems pretty fragile, better to have a more robust encoding method.

I suppose the question here is about when the precise ordering within relatedLocations has significance. I think it makes sense for the relative ordering to matter for locations with message properties (since presumably they will be presented to the human reader in that order), but less so for those that are merely referenced by relationships.

Another approach might be to amend the original nestingLevel proposal so that nesting levels only apply to locations with message properties, other locations that are merely referenced don't appear in the ordering. But then you could have the case of a chain of location referenced by relationships where some have messages and some don't, so I don't think that approach works.

FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.

Would it help if I were to join as well?

I asked about this, but unfortunately, OASIS no longer allows guests in meetings, only members.

I'm happy to have an informal videochat with you if it would be helpful, or maybe it's better to keep things within the issue tracker? Not sure.

If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?

For the SARIF standard, yeah, I think it could make sense for the spec to outline a streaming format for SARIF results, potentially based on JSON RPC. The pipe communication seems perhaps outside the purview of the SARIF format though, what do you think? For SG15, the idea is for all of this to end up in the Ecosystem IS. I think after the initial paper has been discussed by SG15, we can decide the best course of action there.

I confess I'm unfamiliar with the workflows/relationships of SG15 and the Ecosystem IS, but I can at least talk about all this with the SARIF TC.

The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?

There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.

Am I right in thinking that MSVC is effectively reusing the Base Protocol part of the LSP for "framing" the JSON-RPC notifications?

I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool

At least for our use cases we have other ways to extract the toolset information we need (e.g. from the CMake info), but this is certainly something to consider if use cases for it come up!

For your SARIF_OUTPUT_PIPE stuff, are you doing this per invocation of the compiler on a single source file, or per invocation of a build framework on a whole project? The SARIF_SOCKET idea from my patch is aiming at the latter, but I haven't thought through the metadata stuff for recording invocations and associating individual result objects with them.

davidmalcolm commented 3 months ago

The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?

There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.

Thanks.

For my reference, I had a look at the boundary between the two notifications in that output, here:

000012e0  65 73 73 61 67 65 22 3a  7b 22 74 65 78 74 22 3a  |essage":{"text":|
000012f0  22 77 68 69 6c 65 20 74  72 79 69 6e 67 20 74 6f  |"while trying to|
00001300  20 6d 61 74 63 68 20 74  68 65 20 61 72 67 75 6d  | match the argum|
00001310  65 6e 74 20 6c 69 73 74  20 27 28 6c 69 7a 61 72  |ent list '(lizar|
00001320  64 29 27 22 7d 7d 5d 7d  7d 7d 43 6f 6e 74 65 6e  |d)'"}}]}}}Conten|
00001330  74 2d 4c 65 6e 67 74 68  3a 20 39 37 39 30 0d 0d  |t-Length: 9790..|
00001340  0a 0d 0d 0a 7b 22 6a 73  6f 6e 72 70 63 22 3a 22  |....{"jsonrpc":"|
00001350  32 2e 30 22 2c 22 6d 65  74 68 6f 64 22 3a 22 4f  |2.0","method":"O|
00001360  6e 53 61 72 69 66 52 65  73 75 6c 74 22 2c 22 70  |nSarifResult","p|
00001370  61 72 61 6d 73 22 3a 7b  22 72 65 73 75 6c 74 22  |arams":{"result"|
00001380  3a 7b 22 72 75 6c 65 49  64 22 3a 22 43 32 36 36  |:{"ruleId":"C266|
00001390  35 22 2c 22 6c 65 76 65  6c 22 3a 22 65 72 72 6f  |5","level":"erro|
000013a0  72 22 2c 22 6d 65 73 73  61 67 65 22 3a 7b 22 74  |r","message":{"t|

where the 2nd Content-Length does indeed begin immediately after the final } from the 1st notification (at offset 0x132a if I'm counting bytes correctly).

davidmalcolm commented 3 months ago

Note to self: vs-streamjsonrpc documents various message delineation approaches here: https://github.com/microsoft/vs-streamjsonrpc/blob/main/doc/extensibility.md#message-handlers

It appears that MSVC is using the HeaderDelimitedMessageHandler, which the above link says "is the default handler. It is compatible with the vscode-jsonrpc NPM package."

TartanLlama commented 3 months ago

I'm happy to have an informal videochat with you if it would be helpful, or maybe it's better to keep things within the issue tracker? Not sure.

Maybe you can discuss with the SARIF TC and I can talk to SG15 and then we can schedule a videocall?

I confess I'm unfamiliar with the workflows/relationships of SG15 and the Ecosystem IS, but I can at least talk about all this with the SARIF TC.

SG15 have teleconferences between in-person WG21 meetings, the next should be in the next week or so. The Ecosystem IS is a planned international standard that SG15 is working on that standardises tooling elements of C++ that until now have lived outside the purview of the C++ standard. SARIF output and communication is part of that.

Am I right in thinking that MSVC is effectively reusing the Base Protocol part of the LSP for "framing" the JSON-RPC notifications?

Yeah, as you note in a later message, it's part of the existing vs-streamjsonrpc format.

For your SARIF_OUTPUT_PIPE stuff, are you doing this per invocation of the compiler on a single source file, or per invocation of a build framework on a whole project?

Currently MSBuild creates a pipe per compiler invocation and passes back information to the IDE via existing VS <-> MSBuild communication channels

ShiningMassXAcc commented 3 months ago

Adding my very quick comments here summarized from the TC discussion today (caveating that I'm not a heavy user of these features). Also citing to leverage Aditya for CodeQL in particular, as it has more hierarchical data that what I've leveraged historically.

Overall, it feels like the result object already has properties that support similar structure of data (codeflows, graphs, graphtraversals, stacks). The related locations list, to me, is best served being kept simple as a reference material for the select properties that could reference it.

This data looks like you want a graph traversal through a graph. A graph (3.39.1 in sarif 2.1.0) is general purpose described as "a network of nodes and directed edges that describes some aspect of the structure of the code". This seems to fit the purpose quite well to me. Locations within this scheme are defined on nodes in 3.40.4 in sarif 2.1.0, instead of the related locations list. There are various message properties within these structures that can be used to provide context to the consumer when rendering.

davidmalcolm commented 3 months ago

Adding my own summary of the 2024-08-08 TC discussion about this:

We brainstormed a bit on the various ways that SARIF could be extended to support nested messages:

Sy's proposal to add a nestingLevel property, and to require ordering of related locations to matter (or maybe just for those with a message and/or nestingLevel?)
express the nesting as nesting within the JSON itself. There was wariness about doing this (IIRC "difficulty of checking a recursive schema" was mentioned)
add an optional parent/child relationship between relatedLocations as a new kind of relationship
add an optional ordering relationship between relatedLocations as a new kind of relationship
put this in some other SARIF field, e.g. like a stack.
- How would a consumer present this to the user?

IIRC the consensus was that we hadn't yet got enough implementation experience with the idea to be able to recommend a particular approach (but I could be misremembering), and I said I'll continue to experiment with doing it from the GCC side.

After the meeting another idea occurred to me: to extend 3.11 message object so as well as "plain text" and "formatted" messages, support structured messages - somehow.

I hope I'm correctly remembering/summarizing the discusssion.

davidmalcolm commented 1 month ago

I'll try to summarize the various unknowns raised by P3358R0 SARIF for Structured Diagnostics here; the wording is my own, though:

How should nested/hierarchical diagnostics be expressed in SARIF?

See e.g. https://github.com/oasis-tcs/sarif-spec/issues/572#issuecomment-2287059926; there's also been discussion about using the graph parts of SARIF for handling this.

Do we need to generalize SARIF to cover "dynamic" or "progressive" notifications about results, rather than dealing with a log that is already complete?

I've filed this as https://github.com/oasis-tcs/sarif-spec/issues/661

How should an IDE interact with a compiler via SARIF?

This feels like a superset of #661, but also supposes some form of IPC.

ShiningMassXAcc commented 12 hours ago

After another look at this, my inclination would currently be to figure out how to t5ransform your relatedLocations reference within the graphtraversal objects. Nodes have locations today: https://docs.oasis-open.org/sarif/sarif/v2.1.0/csprd01/sarif-v2.1.0-csprd01.html#_Toc10541213