Open michaelcfanning opened 1 year ago
Clang is in the middle of working out how to restructure its diagnostics internally, which is driven by the desire for this.
Sorry for the delay on this topic. We are discussing at tomorrow's TC meeting.
There's some appetite in the ISO C++ standards group for converging on SARIF with an extension like this. Has this item been discussed by the SARIF group yet?
Sy posted this updated proposal to SG15: P3358R0 SARIF for Structured Diagnostics; I'm taking a look from the GCC side.
@TartanLlama Hi, thanks for your various proposals.
FWIW I now have a working (but messy) prototype of your latest proposal in GCC:
...but I have some concerns about the way the representation you've picked might interact with other relationships between the locations in a sarifResult
- in particular, the way the nestingLevel
attrs in the property bags seem to require an ordering of the relatedLocations
for them to have meaning.
As I understand it, every sarifResult
has its own "namespace" of numeric IDs for locations within it, and the relatedLocations within a sarifResult
effectively form a directed graph. Currently the only relationship expressable between locations is a chain of includes; see 3.34 locationRelationship object.
I'm about to try adding support for capturing "include chains" to GCC's SARIF output, which is going to mean adding lots of relatedLocations
, and it's probably going to break my nested locations prototype.
So I think this might be better expressed as a new "kind" within locationRelationship
's 3.34.3 kinds property. That said, I haven't yet worked with the locationRelationship
part of the spec; maybe others have opinions here?
FWIW I'm also interested in capturing information about macro expansions, which might be another kind of location relationship; see #650.
Your proposal also mentions using JSON-RPC for sending sarifResult objects one at a time. FWIW I've filed this RFE to implement something like this in GCC (no promises though!)
@TartanLlama Hi, thanks for your various proposals. [...snip...] ...but I have some concerns about the way the representation you've picked might interact with other relationships between the locations in a
sarifResult
- in particular, the way thenestingLevel
attrs in the property bags seem to require an ordering of therelatedLocations
for them to have meaning.
To clarify: clearly you want to preserve the ordering of the messages, but if, say, there are include chains in some of the messages, we need to refer to the locations of the #include
directives. Would these be at the same nestingLevel
as the message they relate to?
If we omit nestingLevel
for such locations, then as per the proposal as written it's implicitly 0, and then the next location after it with a nestingLevel
> 0 would be treated as a child of that location, not the one it's logically part of.
But ideally if a bunch of locations appear in one sarifResult
referencing the same include chain (say foo.h includes bar.h and there are various issues reported in bar.h as part of a hierarchical diagnostic), presumably we want to do this "lazily" and only list each location once in relatedLocations.
Sorry if the above is unclear
Thank for these thoughts, @davidmalcolm (sorry for the delay in responding, I've been on vacation).
Yeah, everything you wrote makes sense to me, I can see how the current design can have issues in these cases. Let's say we did as you suggest and made a new kind
of locationRelationship
. It seems sufficient to make relationships run in one direction, from parent to child. We could call it contains
or something like that. I think in that case, the hierarchical example from the paper would look something like:
"result": {
"message": {
"text": "'pet': no overloaded function could convert all the argument types"
},
"locations": [
{
"id": 0,
"relationships": [
{
"target": 1,
"kinds": [
"contains"
]
}
]
}
],
"relatedLocations": [
{
"id": 1,
"message": {
"text": "or 'void pet(_T0)'"
}
"relationships": [
{
"target": 2,
"kinds": [
"contains"
]
}
]
},
/*...*/
]
}
Is that what you had in mind? Do you see any potential issues with this design? I'll ask the other folks at Microsoft who have been working on this.
@TartanLlama Thanks for the feedback.
FWIW I'm now tracking this idea in GCC's bugtracker as an RFE as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116253
I've attached my patch for G++ and its .sarif output there. This implements the original nestingLevel
property bag property from your P3358R0 version of the proposal.
Is that what you had in mind?
Yes, very much so - thanks!
Do you see any potential issues with this design?
I'll take a look at implementing the relatedLocations version of the proposal to see how well it holds up from the GCC producer side of things.
I'll try to put together a version of the testcase that uses #include
(GCC's SARIF output now captures this; see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107941)
I'll ask the other folks at Microsoft who have been working on this.
Thanks again
I'll try to put together a version of the testcase that uses
#include
(GCC's SARIF output now captures this; see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107941)
@TartanLlama I had a go at doing this, chopping up your test case into https://gist.github.com/davidmalcolm/bc84ca043be0c0ef3a3a4d1e9d748c2b
It turned out my GCC implementation of your nestingLevel
property bag property happens to work due to my implementation deferring all related locations for #include
to after everything else. The output can be seen here: https://gist.github.com/davidmalcolm/38cff5c1995d21fc7d7abbecb3e25a85; if I'm reading things right, there's a location and a bunch of relatedLocations for the hierarchical notes, then a trailing bunch of relatedLocations for the #include
lines.
So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.
@TartanLlama FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.
@TartanLlama There's a whole other aspect to your proposal (beyond nested diagnostics) that I found interesting, which is that in 3.1.1. MSVC you document that MSVC can look for SARIF_OUTPUT_PIPE and send its diagnostics as JSON-RPC notifications there.
This seems really useful for IDE integration, and maybe there's an appetite for standardizing it? If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?
What you document seems to be very Windows-specific (using a UTF-16-encoded integer representation of a HANDLE). I've had a go at implementing it for GCC on Linux. I'm tracking that in GCC's bug tracker as RFE: add support for sending SARIF output via IPC. My first attempt was using a named FIFO, but that had the limitation of only being for a single tool: once the FIFO is closed, we're done. The latest patch uses unix domain sockets: If the environment has SARIF_SOCKET set to a path to a socket, then GCC diagnostics are emitted as JSON-RPC notifications to the socket in a manner akin to that for "SARIF_OUTPUT_PIPE" in P3358R0, albeit with Unix domain sockets. The patch also implements a trivial "sarif-listener" program that creates the socket and listens for notifications. I can run that in one terminal, and run various builds using GCC in another, and see the diagnostics from those builds appear as SARIF notifications in the first terminal.
My patch currently adds (and requires) the command-line option -fdiagnostics-format=sarif-socket
, but it might be good to simply have the environment variable. That way an IDE could make the socket and set the env var when invoking a build framework, and then all diagnostics would go to the IDE, rather than textually, and users wouldn't have to go messing about with command-line variables on all of the different build systems that are out there. Or maybe that's too much "magic"? We could potentially do both text and socket output, but that would require more surgery to GCC's diagnostics framework.
The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?
I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool etc (and maybe also a startOfRun notification?), so that when a connection comes in to the socket it gets metadata as well as just diagnostics.
Hope this is constructive
So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.
Makes sense, I guess we could specify that a producer would have to output locations in this manner, but it seems pretty fragile, better to have a more robust encoding method.
FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.
Would it help if I were to join as well?
If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?
For the SARIF standard, yeah, I think it could make sense for the spec to outline a streaming format for SARIF results, potentially based on JSON RPC. The pipe communication seems perhaps outside the purview of the SARIF format though, what do you think? For SG15, the idea is for all of this to end up in the Ecosystem IS. I think after the initial paper has been discussed by SG15, we can decide the best course of action there.
The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?
There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.
I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool
At least for our use cases we have other ways to extract the toolset information we need (e.g. from the CMake info), but this is certainly something to consider if use cases for it come up!
So I think it happens to work for my implementation, but wouldn't if a producer happened to walk the graph of relationships in a different way when writing out the SARIF.
Makes sense, I guess we could specify that a producer would have to output locations in this manner, but it seems pretty fragile, better to have a more robust encoding method.
I suppose the question here is about when the precise ordering within relatedLocations
has significance. I think it makes sense for the relative ordering to matter for locations with message
properties (since presumably they will be presented to the human reader in that order), but less so for those that are merely referenced by relationships.
Another approach might be to amend the original nestingLevel proposal so that nesting levels only apply to locations with message
properties, other locations that are merely referenced don't appear in the ordering. But then you could have the case of a chain of location referenced by relationships where some have messages and some don't, so I don't think that approach works.
FWIW we're having a meeting of the SARIF technical committee on Thursday; I'll try to raise this issue there.
Would it help if I were to join as well?
I asked about this, but unfortunately, OASIS no longer allows guests in meetings, only members.
I'm happy to have an informal videochat with you if it would be helpful, or maybe it's better to keep things within the issue tracker? Not sure.
If so, should this be a separate issue in this tracker, and perhaps a separate proposal to SG15?
For the SARIF standard, yeah, I think it could make sense for the spec to outline a streaming format for SARIF results, potentially based on JSON RPC. The pipe communication seems perhaps outside the purview of the SARIF format though, what do you think? For SG15, the idea is for all of this to end up in the Ecosystem IS. I think after the initial paper has been discussed by SG15, we can decide the best course of action there.
I confess I'm unfamiliar with the workflows/relationships of SG15 and the Ecosystem IS, but I can at least talk about all this with the SARIF TC.
The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?
There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.
Am I right in thinking that MSVC is effectively reusing the Base Protocol part of the LSP for "framing" the JSON-RPC notifications?
I also wonder if we might want separate notifications for "newResult" and "endOfRun", where the latter could contain information on the tool
At least for our use cases we have other ways to extract the toolset information we need (e.g. from the CMake info), but this is certainly something to consider if use cases for it come up!
For your SARIF_OUTPUT_PIPE
stuff, are you doing this per invocation of the compiler on a single source file, or per invocation of a build framework on a whole project? The SARIF_SOCKET
idea from my patch is aiming at the latter, but I haven't thought through the metadata stuff for recording invocations and associating individual result objects with them.
The precise form of the RPC messages isn't quite clear to me. For example, should there be any trailing newlines after the JSON itself?
There should not be any trailing newlines after the JSON itself. Here's an example of what's sent along the pipe: SarifOutput.txt.
Thanks.
For my reference, I had a look at the boundary between the two notifications in that output, here:
000012e0 65 73 73 61 67 65 22 3a 7b 22 74 65 78 74 22 3a |essage":{"text":|
000012f0 22 77 68 69 6c 65 20 74 72 79 69 6e 67 20 74 6f |"while trying to|
00001300 20 6d 61 74 63 68 20 74 68 65 20 61 72 67 75 6d | match the argum|
00001310 65 6e 74 20 6c 69 73 74 20 27 28 6c 69 7a 61 72 |ent list '(lizar|
00001320 64 29 27 22 7d 7d 5d 7d 7d 7d 43 6f 6e 74 65 6e |d)'"}}]}}}Conten|
00001330 74 2d 4c 65 6e 67 74 68 3a 20 39 37 39 30 0d 0d |t-Length: 9790..|
00001340 0a 0d 0d 0a 7b 22 6a 73 6f 6e 72 70 63 22 3a 22 |....{"jsonrpc":"|
00001350 32 2e 30 22 2c 22 6d 65 74 68 6f 64 22 3a 22 4f |2.0","method":"O|
00001360 6e 53 61 72 69 66 52 65 73 75 6c 74 22 2c 22 70 |nSarifResult","p|
00001370 61 72 61 6d 73 22 3a 7b 22 72 65 73 75 6c 74 22 |arams":{"result"|
00001380 3a 7b 22 72 75 6c 65 49 64 22 3a 22 43 32 36 36 |:{"ruleId":"C266|
00001390 35 22 2c 22 6c 65 76 65 6c 22 3a 22 65 72 72 6f |5","level":"erro|
000013a0 72 22 2c 22 6d 65 73 73 61 67 65 22 3a 7b 22 74 |r","message":{"t|
where the 2nd Content-Length
does indeed begin immediately after the final }
from the 1st notification (at offset 0x132a if I'm counting bytes correctly).
Note to self: vs-streamjsonrpc documents various message delineation approaches here: https://github.com/microsoft/vs-streamjsonrpc/blob/main/doc/extensibility.md#message-handlers
It appears that MSVC is using the HeaderDelimitedMessageHandler
, which the above link says "is the default handler. It is compatible with the vscode-jsonrpc NPM package."
I'm happy to have an informal videochat with you if it would be helpful, or maybe it's better to keep things within the issue tracker? Not sure.
Maybe you can discuss with the SARIF TC and I can talk to SG15 and then we can schedule a videocall?
I confess I'm unfamiliar with the workflows/relationships of SG15 and the Ecosystem IS, but I can at least talk about all this with the SARIF TC.
SG15 have teleconferences between in-person WG21 meetings, the next should be in the next week or so. The Ecosystem IS is a planned international standard that SG15 is working on that standardises tooling elements of C++ that until now have lived outside the purview of the C++ standard. SARIF output and communication is part of that.
Am I right in thinking that MSVC is effectively reusing the Base Protocol part of the LSP for "framing" the JSON-RPC notifications?
Yeah, as you note in a later message, it's part of the existing vs-streamjsonrpc format.
For your SARIF_OUTPUT_PIPE stuff, are you doing this per invocation of the compiler on a single source file, or per invocation of a build framework on a whole project?
Currently MSBuild creates a pipe per compiler invocation and passes back information to the IDE via existing VS <-> MSBuild communication channels
Adding my very quick comments here summarized from the TC discussion today (caveating that I'm not a heavy user of these features). Also citing to leverage Aditya for CodeQL in particular, as it has more hierarchical data that what I've leveraged historically.
Overall, it feels like the result object already has properties that support similar structure of data (codeflows, graphs, graphtraversals, stacks). The related locations list, to me, is best served being kept simple as a reference material for the select properties that could reference it.
This data looks like you want a graph traversal
through a graph
. A graph
(3.39.1
in sarif 2.1.0) is general purpose described as "a network of nodes and directed edges that describes some aspect of the structure of the code". This seems to fit the purpose quite well to me. Locations within this scheme are defined on nodes in 3.40.4
in sarif 2.1.0, instead of the related locations list. There are various message properties within these structures that can be used to provide context to the consumer when rendering.
Adding my own summary of the 2024-08-08 TC discussion about this:
We brainstormed a bit on the various ways that SARIF could be extended to support nested messages:
Sy's proposal to add a nestingLevel property, and to require ordering of related locations to matter (or maybe just for those with a message
and/or nestingLevel
?)
express the nesting as nesting within the JSON itself. There was wariness about doing this (IIRC "difficulty of checking a recursive schema" was mentioned)
add an optional parent/child relationship between relatedLocations
as a new kind of relationship
add an optional ordering relationship between relatedLocations
as a new kind of relationship
put this in some other SARIF field, e.g. like a stack
.
IIRC the consensus was that we hadn't yet got enough implementation experience with the idea to be able to recommend a particular approach (but I could be misremembering), and I said I'll continue to experiment with doing it from the GCC side.
After the meeting another idea occurred to me: to extend 3.11 message
object so as well as "plain text" and "formatted" messages, support structured messages - somehow.
I hope I'm correctly remembering/summarizing the discusssion.
I'll try to summarize the various unknowns raised by P3358R0 SARIF for Structured Diagnostics here; the wording is my own, though:
See e.g. https://github.com/oasis-tcs/sarif-spec/issues/572#issuecomment-2287059926; there's also been discussion about using the graph
parts of SARIF for handling this.
I've filed this as https://github.com/oasis-tcs/sarif-spec/issues/661
This feels like a superset of #661, but also supposes some form of IPC.
After another look at this, my inclination would currently be to figure out how to t5ransform your relatedLocations reference within the graphtraversal objects. Nodes have locations today: https://docs.oasis-open.org/sarif/sarif/v2.1.0/csprd01/sarif-v2.1.0-csprd01.html#_Toc10541213
Context: in Visual Studio we're working on hierarchical diagnostics to present compiler errors in a more structured way than plain text. Clang is also working on a similar project.
A C++ example:
MSVC outputs an error message like this in plain text:
However, there's a logical structure which isn't represented here, which looks something like:
In order to represent this hierarchy, we want to be able to encode it in SARIF. However, there's currently no facility for this.
The way we do this today is by placing the nested diagnostics in a
relatedLocations
object. To each related location, we add aproperties
property which contains anestingLevel
property. This is essentially the same as thenestingLevel
property ofthreadFlowLocation
objects ([3.38.10]).For example:
I'd like for all C++ compilers to agree on how to encode hierarchical diagnostics, and having it standardised in SARIF seems to be the best way to achieve this.