oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues
https://github.com/oasis-tcs/sarif-spec
Other
166 stars 47 forks source link

Should we allow markdown in messages? #33

Closed ghost closed 6 years ago

ghost commented 6 years ago

Copied from https://github.com/sarif-standard/sarif-spec/issues/275, created by @lgolding:

... for example, bold, italic, and hyperlinks?

One possibility is to explicitly allow Markdown. Another is to add a property that specifies one of an enumerated list of allowed formats ("Markdown", "HTML"...) Keeping in mind:

ghost commented 6 years ago

Notes from TC con call, 2017/09/27:

Action: specify plaintext + some Markdown variant.

Luke: Is there a way to specify links to locations in the source code? (MikeFan: "publicly defined micro-format to refer to source code location")

michaelcfanning commented 6 years ago

@lgolding, have you identified a safe variant of markdown to propose as a standard? should we distinguish between plaintext vs. markdown properties or is it sufficient to note that any user-facing text may be markdown?

in re: embedded links, markdown supports these already. I propose that we define a new protocol that represents an a SARIF file reference (which can include an associated region/annotated code location). I will open a separate issue on this. I believe that answering the two previous questions will allow us to potentially close on this issue at next TC meeting.

ghost commented 6 years ago

@michaelcfanning I'm leaning towards GitHub-Flavored Markdown (GFM), because

It is no "safer" than any other Markdown variant, since it permits arbitrary HTML. You pointed me at a StackOverflow article that provided a Python-based solution for sanitizing HTML.

IMO we should distinguish between plain-text and Markdown properties. I don't think we should allow Markdown in (for example)

I think it makes sense in

ghost commented 6 years ago

Although GFM itself is not intrinsically safe, the reference implementation of the GFM parser has been designed with security in mind:

To ensure that the rendered Markdown in our website is fully compliant with the CommonMark spec, the new backend implementation for GFM parsing on GitHub is based on cmark, the reference implementation for CommonMark developed by John MacFarlane and many other fantastic contributors. ... The first Markdown parsers in C had a terrible security history: it was feasible to cause stack overflows (and sometimes even arbitrary code execution) simply by nesting particular Markdown elements sufficiently deep. The cmark implementation, just like our earlier parser Sundown, has been designed from scratch to be resistant to these attacks. The parsing algorithms and its AST-based output are thought out to gracefully handle deep recursion and other malicious document formatting.

https://githubengineering.com/a-formal-spec-for-github-markdown/

BUT:

Security

By default, the library will pass through raw HTML and potentially dangerous links (javascript:, vbscript:, data:, file:).

It is recommended that users either disable this potentially unsafe feature by using the option CMARK_OPT_SAFE (or --safe with the command-line program), or run the output through an HTML sanitizer to protect against XSS attacks.

https://github.com/github/cmark

Consumers of Markdown do so at their own risk, and need to take extra precautions:

While GFM supports a wide range of inputs, it’s worth noting that GitHub.com and GitHub Enterprise perform additional post-processing and sanitization after GFM is converted to HTML to ensure security and consistency of the website. [emphasis mine]

https://github.github.com/gfm/#introduction

michaelcfanning commented 6 years ago

I like the idea of limiting markdown to properties where it makes most sense. I wonder whether a code flow location, however, might not benefit from having an embedded link?

I suspect that coding this up would yield more observations/discussion. Generally, of course, we shouldn't accept any spec changes at all unless they've been explored in both production and consumption scenarios.

michaelcfanning commented 6 years ago

btw - additional issue to consider, how does a markdown message related to formatted messages?

lcartey commented 6 years ago

Following up from the last TC meeting.

The proposed changes to 3.10.1 create a distinction between brief messages (which should be plain text) and longer messages (which can be either plain text or markdown formatted):

There are two types of message properties:

  1. Brief messages intended to add information to an object. These messages SHALL be written in plain text and SHOULD conform to the guidelines for plain text messages (§3.10.2).
  2. Longer messages intended to provide a full explanation of a scan result or an analysis rule. These messages MAY be written in plain text, in which case they SHOULD conform to the guidelines for plain text messages, or they MAY be written in GitHub-Flavored Markdown (GFM) [GFM], in which case they SHOULD conform to the guidelines for messages with formatting (§3.10.3). Messages with formatting SHALL NOT be written in any other markup language.

However, as a direct producer our primary motivation for using markdown is to include inline links in our "brief" existing messages. I think the spec should be adjusted to avoid conflating brevity with formatting choice.

It seems unlikely that all consumers of SARIF will be able to use markdown in their "output" (whatever that may be). If a result only has a markdown formatted message, will the SARIF consumer need to explicitly convert that to plain text (if it only supports plain text output)? Or will producers always need to provide a plain-text alternative along side the formatted version?

fishoak commented 6 years ago

This is my submission I promised after yesterday's meeting.

Let me describe what we do by way of an example. This should clarify what I said at the meeting, and I hope make the case that some form of formatting would be very helpful.

The attached image shows a buffer overrun that CodeSonar found as rendered in a web client: kanbneffdpobgjkk

Let's start at the bit at the end that is labelled "Buffer Overrun". This is the primary warning location (we informally call this the endbox), and would correspond to the message property of a result object in Sarif. The other events (such as the two on line 2217) would correspond to elements of the locations property.

[@fishoak - to clarify: those other two events would correspond to elements of the result.codeFlows[i].locations array -- that this, they refer to locations within a code flow -- and not to elements of the result.locations array. Per the spec: "The locations array shall not contain more than one element unless the condition indicated by the result, if any, can only be corrected by making a change at every location specified in the array." -- @lgolding]

As you can see, the message there is fairly rich. It has a code font, paragraphs, and a bullet list. It also has references to other code locations, and links to other events. The formatting is not just eye candy; it's linked to the code. So if you hover over one of those tokens that are in the code font (e.g., newloc), this causes other locations in the code and in the messages associated with other locations to get highlighted, and there is a menu that gives you a bit of navigation functionality. Hovering over another event identifier will also highlight that event (or you can just click on it).

The analysis itself doesn't actually create this message. It generates an abstract representation expressed as ReML (this is the language I mentioned on the call; it stands for Result Markup Language). It actually isn't important to know what is in there for this discussion, but a snippet below gives you a flavor of what it is:

<msg_buf_overrun frames="4" nodes="39" stack_depth="2" collapsed_calls="8" control_points="1" dominating_control_points="1" visible_frames="2" visible_nodes="11" visible_stack_depth="1" visible_collapsed_calls="2" visible_control_points="1" visible_dominating_control_points="1" events="9" assignment_events="8" precond_events="0" pnodes="522" relevant_pnodes="120" indirect_calls="0" taintedness="0" rule="SATISFIED" dp_refinement_exact="CS_DISABLED" dp_refinement_approximate="CS_TIMEOUT" endpt_libmodel="strcpy" storage_class="heap" is_write="1" written_value_is_constant="0" is_total="1">
  <msg_buffer_ptr>
    <msg_value disposition="heap_address" cyclic="0">
      <msg_dev a="1" b="0" c="1">
        <msg_ap_param expr="newloc" number="1"/>
      </msg_dev>
        <msg_gen_eval temp="1">
          <msg_dev a="1" b="0" c="1">
            <msg_ap_add><msg_ap_return_summary procedure="malloc" expr="malloc(strlen(s))"/>
              <msg_ap_number value="1"/>
            </msg_ap_add>
         </msg_dev><link msg="m5">lexpgn.c:2217</link>
      </msg_gen_eval>

The prose then gets created by code that takes this term as input. This gets done late because the end user might want to see the results in a different natural language. For example: efpecichdkmcnick

What is more relevant however is that for automation purposes it can be retrieved as text or XML. Here's a tiny part of the XML:

<msg msg_id="10">This code writes past the end of the buffer pointed to by <code>newloc</code>.
  <ul>
    <li><code>newloc</code> evaluates to <code>malloc(strlen(s)) + 1</code>
      <subscript><link msg="m5">lexpgn.c:2217</link></subscript>.
    </li>

The XML schema admits only very limited markup. Here's the grammar:

text_markup = (code | plain | placeholder | link | text | explanation | on_path_legend | subscript | msg_static_warning_text | msg_pmd_static_warning_text)

ul = element ul { (li | text)* }
li = element li { (ul | text_markup)* }
code = element code { text_markup* }
plain = element plain { text_markup* }
placeholder = element placeholder { text_markup* }
paragraph = element paragraph { text_markup* }
explanation = element explanation { text_markup* }
on_path_legend = element on_path_legend { text_markup* }
subscript = element subscript { text_markup* }
msg_static_warning_text = element msg_static_warning_text {
    attribute wclass{text}
}

Of these, some are not relevant for this discussion, but you can see the formatting directives in there. If I were to rank those in importance I would say:

I concede you can fake paragraph breaks and bulleted lists with plaintext, but then you have a responsibility in the spec to state that newlines should be respected when the message is rendered.

Note that I haven't said anything about links, but that is only because the feeling of the group seems to be that they should be supported. I consider them more essential than markup. (I think that having links to the other location objects in the same result object is important, but I concede I haven't looked closely enough to see if that can be done, and it's orthogonal to formatting, so it's a discussion for another time.)

katrinaoneil commented 6 years ago

And here is what Foritfy does with respect to formatting of messages. Below is a screenshot of the issue description excerpt displayed in our issue auditing tool.

awb-screenshot

The XML behind this description looks like this (sorry for the tiny font, but if you click on it, it becomes bigger):

xml-screenshot

Our descriptions support a list of standard HTML tags, including bold/italic/underline, code, lists, links and a few others. However, we also have a list of custom tags that are used for substituting dynamic values. For example, <Replace key="PrimaryLocation.line"/> gets substituted with the actual line number where the issue is reported, while <Replace key="EnclosingFunction.name"/> gets substituted with the name of the function that contains the reported issue.

The same description can be seen in the browser: https://vulncat.hpefod.com/en/detail?id=desc.dataflow.java.xpath_injection#C%23%2FVB.NET%2FASP.NET. Note that the abstract that is displayed online is generic and is taken from the <AltParagraph/> tag.

ghost commented 6 years ago

Fixed in c443e9a.