oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues
https://github.com/oasis-tcs/sarif-spec
Other
162 stars 46 forks source link

GCC now supports SARIF as an output format #531

Open davidmalcolm opened 2 years ago

davidmalcolm commented 2 years ago

I'm not sure how best to contact the SARIF community (the "Ask a Question" link takes me to this issue tracker), but here's a heads-up that I've implemented SARIF output support for GCC trunk, for GCC 13 onwards (with this 3000 line patch): https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596138.html

As noted above, my generated .sarif files seem to pass the online validator, are viewable via the online React-based viewer, and seem to work OK in the VS Code extension for SARIF. I hope I'm correctly implementing everything.

You can see it running live on Compiler Explorer; here's an example of it emitting SARIF reporting a path-sensitive double-free: https://godbolt.org/z/nYWMM8Wx7

FWIW I'm now experimenting with GCC accepting SARIF as input (i.e. as a consumer), so that it can "replay" diagnostics serialized in SARIF form.

dmk42 commented 2 years ago

Awesome, @davidmalcolm ! Thanks for doing this and for letting us know about it!

davidmalcolm commented 2 years ago

FWIW I've now posted an initial set of patches that let GCC consume SARIF, replaying a .sarif file: https://gcc.gnu.org/pipermail/gcc-patches/2022-June/597051.html

(very much just a prototype for now, though)

davidmalcolm commented 2 years ago

The GCC RFE for tracking being able to replay SARIF is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96032

michaelcfanning commented 1 year ago

@davidmalcolm, this really great! if you'd like, we'd be very glad to take a detailed look at your SARIF and see if we have any suggestions to refine.

i'm very glad to hear that some of the assets we put out on the web were useful crafting this content.

I have looked at the Compiler Explorer example, is that definitive/comprehensive? Do you have other substantive examples available?

[Looks like you are utilizing some very interesting SARIF domains, taxa, logical locations, etc.]

davidmalcolm commented 1 year ago

Example of GCC SARIF output showing LTO report of a cross-TU type-mismatch in a use of a variadic API: https://gist.github.com/davidmalcolm/26a546a0cd3ed253648e24510bd17964 (without reformatting, so it's all on one line)

davidmalcolm commented 1 year ago

same example (of LTO), reformatted for ease of reading: https://gist.github.com/davidmalcolm/b0fa6b14909315f45f6f5fe16f623c0f

Note how this example spans two different source files; both have their full content quoted in the "artifacts" section (my producer code adds an artifact with content for any files referenced by any results mentioned in the file)

For reference, the regular (non-SARIF) GCC output for this is:

../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c: In function ‘called_by_test_type_mismatch_1’:
../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c:19:7: warning: ‘va_arg’ expected ‘const char *’ but received ‘int’ for variadic argument 1 of ‘ap’ [CWE-686] [-Wanalyzer-va-arg-type-mismatch]
   19 |   str = va_arg (ap, const char *); /* { dg-warning "'va_arg' expected '\[^\n\r\]*' but received 'int' for variadic argument 1 of 'ap'" } */
      |       ^
  ‘test_type_mismatch_1’: events 1-2
    |
    |../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-b.c:3:6:
    |    3 | void test_type_mismatch_1 (void)
    |      |      ^
    |      |      |
    |      |      (1) entry to ‘test_type_mismatch_1’
    |    4 | {
    |    5 |   called_by_test_type_mismatch_1 (42, 1066);
    |      |   ~   
    |      |   |
    |      |   (2) calling ‘called_by_test_type_mismatch_1’ from ‘test_type_mismatch_1’ with 1 variadic argument
    |
    +--> ‘called_by_test_type_mismatch_1’: events 3-4
           |
           |../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c:12:1:
           |   12 | called_by_test_type_mismatch_1 (int placeholder, ...)
           |      | ^
           |      | |
           |      | (3) entry to ‘called_by_test_type_mismatch_1’
           |......
           |   19 |   str = va_arg (ap, const char *); /* { dg-warning "'va_arg' expected '\[^\n\r\]*' but received 'int' for variadic argument 1 of 'ap'" } */
           |      |       ~
           |      |       |
           |      |       (4) ‘va_arg’ expected ‘const char *’ but received ‘int’ for variadic argument 1 of ‘ap’
           |
davidmalcolm commented 1 year ago

Another example (reformatted so it's not all on one line, for ease of reading by humans): https://gist.github.com/davidmalcolm/5771af86039bf57f144e935179bfe20e This is a reproducer for detecting CVE-2011-2210, which was a flaw in the Linux kernel.

For reference, the non-SARIF GCC output for this one looks like:

../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c: In function ‘sys_osf_getsysinfo’:
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:58:9: warning: state: ‘tainted’
   58 |         __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "tainted" } */
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:67:17: warning: state: ‘has_lb’
   67 |                 __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "has_lb" } */
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:69:21: warning: use of attacker-controlled value ‘nbytes’ as size without upper-bounds checking [CWE-129] [-Wanalyzer-tainted-size]
   69 |                 if (copy_to_user(buffer, hwrpb, nbytes) != 0) /* { dg-warning "use of attacker-controlled value 'nbytes' as size without upper-bounds checking" } */
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ‘sys_osf_getsysinfo’: event 1
    |
    |   35 |         long sys##name(__SC_DECL##x(__VA_ARGS__))
    |      |              ^~~
    |      |              |
    |      |              (1) function ‘sys_osf_getsysinfo’ marked with ‘__attribute__((tainted_args))’
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:30:9: note: in expansion of macro ‘__SYSCALL_DEFINEx’
    |   30 |         __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
    |      |         ^~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:37:36: note: in expansion of macro ‘SYSCALL_DEFINEx’
    |   37 | #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
    |      |                                    ^~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:53:1: note: in expansion of macro ‘SYSCALL_DEFINE5’
    |   53 | SYSCALL_DEFINE5(osf_getsysinfo, unsigned long, op, void __user *, buffer,
    |      | ^~~~~~~~~~~~~~~
    |
    +--> ‘sys_osf_getsysinfo’: event 2
           |
           |   35 |         long sys##name(__SC_DECL##x(__VA_ARGS__))
           |      |              ^~~
           |      |              |
           |      |              (2) entry to ‘sys_osf_getsysinfo’
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:30:9: note: in expansion of macro ‘__SYSCALL_DEFINEx’
           |   30 |         __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
           |      |         ^~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:37:36: note: in expansion of macro ‘SYSCALL_DEFINEx’
           |   37 | #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
           |      |                                    ^~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:53:1: note: in expansion of macro ‘SYSCALL_DEFINE5’
           |   53 | SYSCALL_DEFINE5(osf_getsysinfo, unsigned long, op, void __user *, buffer,
           |      | ^~~~~~~~~~~~~~~
           |
         ‘sys_osf_getsysinfo’: events 3-6
           |
           |   64 |                 if (nbytes < sizeof(*hwrpb))
           |      |                    ^
           |      |                    |
           |      |                    (3) ‘nbytes’ has its lower bound checked here
           |      |                    (4) following ‘false’ branch (when ‘nbytes > 31’)...
           |......
           |   67 |                 __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "has_lb" } */
           |      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           |      |                 |
           |      |                 (5) ...to here
           |   68 | 
           |   69 |                 if (copy_to_user(buffer, hwrpb, nbytes) != 0) /* { dg-warning "use of attacker-controlled value 'nbytes' as size without upper-bounds checking" } */
           |      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           |      |                     |
           |      |                     (6) use of attacker-controlled value ‘nbytes’ as size without upper-bounds checking
           |
In file included from ../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:14:
../../src/gcc/testsuite/gcc.dg/analyzer/test-uaccess.h:13:13: note: parameter 3 of ‘copy_to_user’ marked as a size via attribute ‘access (write_only, 1, 3)’
   13 | extern long copy_to_user(void __user *to, const void *from, unsigned long n)
      |             ^~~~~~~~~~~~
davidmalcolm commented 1 year ago

Let me know if there are other examples you'd like to see.

davidmalcolm commented 1 year ago

Note that I'm not expressing the macro expansion information in the SARIF output (is there a way to do that?)

davidmalcolm commented 1 year ago

Also: I've tried to express metadata about events via threadFlowLocation "kinds" property values, but I have some kinds of events that don't fit well with the existing examples; see https://github.com/oasis-tcs/sarif-spec/issues/530 for more info.

michaelcfanning commented 1 year ago

Hello! Took a look and there's lots to talk about. This work is really great and we'd love to help keep building it out. Here some starter thoughts:

Right now, a couple of people are looking at your log files in various viewers and I can provide them back to you with some updates (though you may be able to just glean what you need from the above).

Looking forward to continuing the discussion!

davidmalcolm commented 1 year ago

Sorry for the delay in getting back to you, and thanks for taking the time to look at it.

Addressing one specific point:

* Can you explain your question around macro expansion? SARIF defines a `originalUriBaseId` property for defining absolute paths for non-deterministic source roots, which you're already populating. Would you like other macro/var definitions to be expressed in the format?

...when I spoke of macro expansion, I was referring to languages with a preprocessor, such as C/C++, where the question of "where in the source-code-under-analysis are we?" can involve a nested series of macro expansions, potentially involving multiple files (e.g. use of a macro declared in one header, which refers to a macro in another header, etc).

Consider e.g.:


#include <stdlib.h>

#define FREE(X) free(X)
#define REALLY_FREE(X) FREE(X)
#define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)

void test (void *p, int flag)
{
  MAYBE_FREE(p, flag);
  MAYBE_FREE(p, flag);
}

GCC output: https://godbolt.org/z/87vf1cGKK

where GCC's textual output can emit the chain of macro expansions:

<source>: In function 'test':
<source>:3:17: warning: double-'free' of 'p' [CWE-415] [-Wanalyzer-double-free]
    3 | #define FREE(X) free(X)
      |                 ^~~~~~~
<source>:4:24: note: in expansion of macro 'FREE'
    4 | #define REALLY_FREE(X) FREE(X)
      |                        ^~~~
<source>:5:37: note: in expansion of macro 'REALLY_FREE'
    5 | #define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)
      |                                     ^~~~~~~~~~~
<source>:10:3: note: in expansion of macro 'MAYBE_FREE'
   10 |   MAYBE_FREE(p, flag);
      |   ^~~~~~~~~~

There didn't seem to be a way to express this within SARIF. Is there one, or did I miss it? Thanks!

davidmalcolm commented 1 year ago

clang's textual output can express similar macro expansion information, such as: https://godbolt.org/z/4ab3EEs3T

<source>:9:3: error: call to undeclared function 'misspelled_free'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  MAYBE_FREE(p, flag);
  ^
<source>:5:37: note: expanded from macro 'MAYBE_FREE'
#define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)
                                    ^
<source>:4:24: note: expanded from macro 'REALLY_FREE'
#define REALLY_FREE(X) FREE(X)
                       ^
<source>:3:17: note: expanded from macro 'FREE'
#define FREE(X) misspelled_free(X)
                ^

...though it emits the expansion in the opposite order to GCC.

davidmalcolm commented 1 year ago

FWIW I've extended the GCC SARIF output support so that it now captures crashes of GCC: https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614032.html treating them as an "error"-level "notification" relating to the operation of the tool (SARIF v2.1.0 section 3.58), rather than a "result" about the code being analyzed by the tool.

davidmalcolm commented 1 year ago

Hello! Took a look and there's lots to talk about. This work is really great and we'd love to help keep building it out. Here some starter thoughts:

* If possible, it'd be best if you could emit your "version" property at the beginning of the file, not the end. The SARIF spec [recommends this](https://docs.oasis-open.org/sarif/sarif/v2.1.0/os/sarif-v2.1.0-os.html#_Toc34317480).

FWIW it turned out that my rather naive JSON code was traversing keys in objects in non-deterministic order when serializing, so that the ordering would vary from run to run.

I've fixed that for GCC 13 here: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7f1e15f743357e037d7c4f6f6000863c26f3dfc3

Should the "version" be the very first thing, or should the "$schema" come first?

davidmalcolm commented 1 year ago

FWIW I've fixed a bug in GCC's SARIF output where it was naively assuming that source code was UTF-8, leading to invalid UTF-8 in the .sarif files when quoting artifacts (either fully, or via snippets). I've fixed this for GCC 13 in https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614600.html which has some "fun" test cases e.g. for reporting invalidly encoded artifacts via a validly encoded .sarif log file.

sthagen commented 1 year ago

FWIW it turned out that my rather naive JSON code was traversing keys in objects in non-deterministic order when serializing, so that the ordering would vary from run to run.

I've fixed that for GCC 13 here: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7f1e15f743357e037d7c4f6f6000863c26f3dfc3

Should the "version" be the very first thing, or should the "$schema" come first?

Well, JSON is catering many host languages. Most of them do still not understand how to implement stable insert order and maintain that or are insisting on not doing it ...

Our toy-like languages Python (dicts) and JavaScript (for string keys in objects) offer insert ordering these days.

But, if there is a JSON library chosen at the consumer side, that libraries authors may have decided to garble the order of keys as they will still comply to the JSON spec.

Long story short, this may be an optimization which is not as important as enforcing correct character encoding.

Nice work, thanks a lot for providing your resources to create the tool. Much appreciated.

PS: In CSAF v2.0 we therefore provided the schema "properties" in sorted order as this should be trivial for insert order preserving producers and consumers and is trivial to re-establish per sorting in case one link in the processing chain messed with the order.

davidmalcolm commented 1 year ago

Should the "version" be the very first thing, or should the "$schema" come first? I've filed https://github.com/oasis-tcs/sarif-spec/issues/571 about this

davidmalcolm commented 11 months ago

I've now added support to gcc trunk (for gcc 14) for capturing timing/profile information about the compile/analysis in SARIF form (via a custom property in the property bag on the invocation object).

davidmalcolm commented 9 months ago

I've now add support to GCC trunk (for GCC 14) for capturing multithreaded code flows in GCC diagnostics, including in SARIF output (although nothing in GCC currently makes use of this).

davidmalcolm commented 3 months ago

Also changed in the upcoming GCC 14:

davidmalcolm commented 2 months ago

FWIW I mentioned SARIF in the GCC 14 release notes in a few places: https://gcc.gnu.org/gcc-14/changes.html#sarif

davidmalcolm commented 1 month ago

I've added the artifact.roles property to GCC trunk for GCC 15: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653536.html

davidmalcolm commented 2 weeks ago

I've created a page on the GCC wiki to track our SARIF support: https://gcc.gnu.org/wiki/SARIF

I've also updated my test suites to validate the generated .sarif files against the schema, and fixed some bugs where we could generate invalid .sarif output; see the above link for details.