Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets

michaelcfanning commented 5 years ago

Our snippets are file contents. Some tools produce formatted snippets. For example, they create a gutter in the left margin that shows line numbers. They provide syntax coloring for the snippet.

SARIF has no way for a tools producer to create these formatted snippets. Instead, the consumer is required to examine the textual snippet and do something sensible with it. We have done this internally in our pipeline with some success.

Currently, we are looking at converting a tool's output that produces a formatted snippet. It occurred to me that we could define another property on fileContent that would be designed to store a formatted version of its contents. I suppose this would be called fileContent.richText in order to follow other places in the format where we allow for markdown-rendered text.

michaelcfanning commented 5 years ago

Per offline discussion, there's a question around whether GFM allows for colorization of snippets. Note that the markdown embedded in this note uses HTML to provide colorization. This displays in VS code but not here.

Item	Status	Notes
Baseline existing source	Done

Here's a CSS snippet using a markdown language hint. Note that it is colorized.

#button {     border: none; }

ghost commented 5 years ago

Two thoughts:

With a plain-text snippet, the location of the snippet together with the location of the result allows a SARIF consumer (such as a Web UI) easily to display the snippet and indicate the position of the result, for example:
```
int x = 0;
int z = y / x;
            ^ divide by zero
```

But if you allow a "formatted snippet" containing Markdown, it will be difficult for the consumer to locate the actual character position within the marked up source code where the problem occurred:

123    <span style="color:blue">int</span> x = 0;<br/>
124    <span style="color:blue">int</span> z = y / x;
                ^ divide by zero

Relying on the GFM "triple-backtick + language" syntax is a nice idea. But then we don't need a formatted snippet at all -- just an indication of the source code's language. A new property file.sourceLanguage might serve (together with a run-level property defaultSourceLanguage).

But!

That doesn't give you the line numbers.
We'll have a huge discussion about what languages to include in this sourceLanguage property's enumeration, or whether to leave it open (and then leave it to consumers to decide if "C" is the same as "c", or "C++" the same as "cplusplus").

I suggest this feature might not be worth the trouble.

michaelcfanning commented 5 years ago

I independently reached some of your points but not your final conclusion. :) First, great point, plaintext snippets are inherently valuable. I'd independently reached the same conclusion as you: what we need here is simply a language designation.

I do think this feature is worth the trouble, mostly due to observing our web development team struggle with the problem of how to handle code snippets in our browser-based results explorer. For this reason, I'm willing to invest time in discussion. :)

We should favor the approach that @fishoak suggested in TC, an open-ended enum populated with some preliminary values. When defining new language values, we can suggest that producers stick with certain conventions:

1) use an entirely lower-case name 2) use a hierarchical string if it's helpful to specify a particular language variant or implementation (e.g., 'sql/tsql 3) prefer spelling out (or explicitly advise against it) things like plusplus/++ or sharp/# (we should make a calll on this). 4) allow for some level of duplication to emerge around use of standard abbreviations (or explicitly advice against). a consumer interested in vb, for example, might reasonably look for either 'vb' or 'visualbasic'

With the guidance above, if a tool producer happens to decide to author a static analysis capability for the 'Racket' or 'R++' languages, the conventions we provide allow for predictable enum values of 'racket' and 'rplusplus' respectively.

Here's a list of languages/formats that are extremely current in 'most utilized' data. Note that not all of these have significant static analysis tooling eco-systems. I've explicitly avoided adding languages that are not currently highly utilized for which a language value can easily be predicted ('fortran', 'basic', 'lisp', 'vbscript', etc.)

javascript java python ruby c++ or cplusplus csharp or c# c css go html objectivec scala swift typescript php sql, [sql/tsql, sql/psql] powershell shell/unix [shell/bash, shell/csh, shell/tcsh, shell/ksh, shell/sh] plaintext markdown [markdown/gfm]

ghost commented 5 years ago

I suggest "spelling out" the language names. That will allow SDKs that wish to do so to use them as identifiers (for example, enum values).

I'm surprised not to see "html" on the language list. There's lots of static analysis around it. We should add it. (html, html/5?)

With those nits out of the way, here what I think we agree on:

We do not add region.formattedSnippet.
We add file.sourceLanguage and run.defaultSourceLanguage with values as you suggest, with my modifications.
We do not make any explicit provision for line numbers. The consumer knows the line numbers from the region properties and can display them if it wants to.

That's enough for me to write the change draft.

michaelcfanning commented 5 years ago

Yes to all three of your points. I meant to note the same thing as your #3, all line data can be recovered from the region properties that are strongly associated with the snippet.
Yes, 'html' was an oversight and 'xml' probably should be on the list as well.
I don't think we need to provide for language version. The reason is that display for these snippets will mostly entail syntax coloring (not actual interpretation, compilation, etc.). i think most languages evolve in a way where the superset of keywords for most current language will allow a decent job. might not be perfect but sufficient for this scenario,

michaelcfanning commented 5 years ago

btw - can't resist, isn't it true that the plain text snippet example you provided could be constructed entirely from other SARIF constructs, suggesting that you shouldn't be doing this? for the same reasons that you shouldn't be injecting line numbers? 'divide by zero' here would presumably derive from the rule id. the spces and ^ location would derive from region data.

also, note that a viewer might usefully provide syntax coloring for this snippet, while still finding value in using the ^ notation to call out the specific location. so, with our new proposal what you'd do is specify the language, provide only the snippet of source, and then reconstruct the ^ + other formatting at runtime.

basically, all snippets are just that, snippets of a file. don't inject anything else into them. what this means is that if a tool itself has some formatted notion of a result, SARIF doesn't provide a place to flow this formatted content along.

int x = 0;
int z = y / x;
            ^ divide by zero

ghost commented 5 years ago

Sorry, I didn't explain my example clearly enough.

In my example, this is the snippet:

int x = 0;
int z = y / x;

Because it's plain text, and because the consumer has the region location, snippet location, and result message properties available, the consumer can easily render the snippet like this:

int x = 0;
int z = y / x;
            ^ divide by zero

michaelcfanning commented 5 years ago

Yes, completely agree, sorry for my confusion.

michaelcfanning commented 5 years ago

Open-ended enum
Describe principles for populating member with examples of non-trademarked language names
Add a non-normative appendix that provides a more exhaustive list

katrinaoneil commented 5 years ago

we also support the following languages:

ABAP ActionScript Apex COBOL ColdFusion

Also, would you consider something like JSP a separate "language" for the purposes of snippet colorization?

ghost commented 5 years ago

@katrinaoneil Yes. Similarly, the ASP.NET MVC "Razor" syntax.

kupsch commented 5 years ago

Attached is a spreadsheet of some research I did on programming languages and identifiers in use for them.

I took 12 sources of programming languages lists, including 3 that rank the top 50 or 100, editors, javascript syntax highlighters, lines of code counters, and giant lists of languages. The spreadsheet is sorted by the 3 ranking sites and then the number of references to the language in other sites. It also includes the id's that the references use for the language (if present) and the display name. I think that we would use commonly used id in SARIF to avoid an impedence mismatch with existing usage. the id, langName, and v_globs columns might be worth including in the appendix.

The list is quite long, but can be reasonably cut off when there are only two non-ranking site listings.

The column descriptions are below:

Tiobe2018-rank - rank in Tiobe 2018 index (https://www.tiobe.com/tiobe-index/)
ieee2018-rank - rank in IEEE 2018 index (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages)
githutinfo2014-rank - rank in githut 2014 index (https://githut.info)
highlightjs - present in (https://highlightjs.org)
vim81 - present in vim 8.1
cloc - present in cloc (https://cloc.sourceforge.net)
loc - present in loc (https://github.com/cgag/loc)
tokei - present in tokei (https://github.com/Aaronepower/tokei)
prismjs2018 - present in PrismJs (https://prismjs.com/)
rosettacode - present in Rosetta Code (http://www.rosettacode.org/wiki/Category:Programming_Languages)
wikipedia - present in Wikipedia (https://en.wikipedia.org/wiki/List_of_programming_languages)
Visualstudio - present in Visual Studio (https://code.visualstudio.com/docs/languages/identifiers)
count - number of non-ranking sources listing language
id - id for the language
langName - Display name for the language
v_globs - glob patterns for files (from vim)
h_display - highlightjs's display name
h_id - highlightjs's primary id
v_id - vim's primary id
h_aliases - highlightjs's id aliases
v_aliases - visual studio's id aliases
p_aliases - PrismJS's id aliases
c_fileExt - cloc's file extensions for language
h_desc - description from highlightjs's
v_desc - description from vim
w_desc - description from wikipedia

languages.xlsx

kupsch commented 5 years ago

Here is new revision of the .xlsx and the .csv file (named .csv.txt to make GitHub happy). The .xlsx file is now correctly displays the couple of non-ascii characters (Excel's default is not uft-8). It also contains minor corrections. The column descriptions are below:

Tiobe2018-rank - rank in Tiobe 2018 index (https://www.tiobe.com/tiobe-index/)
ieee2018-rank - rank in IEEE 2018 index (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages)
githutinfo2014-rank - rank in githut 2014 index (https://githut.info)
highlightjs - present in (https://highlightjs.org)
vim81 - present in vim 8.1
cloc - present in cloc (https://cloc.sourceforge.net)
loc - present in loc (https://github.com/cgag/loc)
tokei - present in tokei (https://github.com/Aaronepower/tokei)
prismjs2018 - present in PrismJs (https://prismjs.com/)
rosettacode - present in Rosetta Code (http://www.rosettacode.org/wiki/Category:Programming_Languages)
wikipedia - present in Wikipedia (https://en.wikipedia.org/wiki/List_of_programming_languages)
Visualstudio - present in Visual Studio (https://code.visualstudio.com/docs/languages/identifiers)
count - number of non-ranking sources listing language
id - id for the language
langName - Display name for the language
v_globs - glob patterns for files (from vim)
h_display - highlightjs's display name
h_id - highlightjs's primary id
v_id - vim's primary id
p_id - PrismJs's primary id
h_aliases - highlightjs's id aliases
v_aliases - visual studio's id aliases
p_aliases - PrismJS's id aliases
c_fileExt - cloc's file extensions for language
h_desc - description from highlightjs's
v_desc - description from vim
w_desc - description from wikipedia

languages.xlsx

languages.csv.txt

oasis-tcs / sarif-spec

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286