Scaladoc should allow filtering html tags

adriaanm commented 6 years ago

Currently, we simply pass through any html tags from source to the documentation. @xuwei-k pointed out that this could be unsafe (say, someone puts a <script> tag with malicious content in a scaladoc comment and it slips by PR review).

To make Scaladoc safer by default, I propose we whitelist allowed tags, and warn/error/entity-encode (on) other tags (tbd).

We should also offer a flag to opt out of this behavior.

adriaanm commented 6 years ago

A good start for a whitelist would be: https://meta.stackexchange.com/questions/1777/what-html-tags-are-allowed-on-stack-exchange-sites (though we should add table-related ones).

Also, note that we emit diagrams/highlighted code using the Raw mechanism, so we should make sure not to break those :-)

martijnhoekstra commented 6 years ago

Another inspiration can be the whitelist that MediaWiki uses: https://meta.wikimedia.org/wiki/Help:HTML_in_wikitext

adriaanm commented 6 years ago

One place to put this logic would be here, as all raw html passes through: https://github.com/scala/scala/blob/v2.13.0-M5/src/scaladoc/scala/tools/nsc/doc/html/HtmlTags.scala#L47-L52

EDIT: actually, we should only escape in this usage of Raw: https://github.com/scala/scala/blob/v2.13.0-M5/src/scaladoc/scala/tools/nsc/doc/html/HtmlPage.scala#L122

NthPortal commented 6 years ago

It also might be worth looking at what GitHub allows in GFM

adriaanm commented 6 years ago

Ideally, we'd just disallow html and only use GFM in docs :-)

xuwei-k commented 6 years ago

Currently, we simply pass through any html tags from source to the documentation.

That's not quite correct. Maybe almost escape html tags. see following example.

https://github.com/scala/scala/pull/7183

source:

/**
 * <script>alert('escape')</script>
 *
 * {{{
 *
 * <script>alert('also escape')</script>
 * 
 * // <script>alert('not escape in comment')</script>
 * 
 * "<script>alert('not escape in string literal')</script>"
 *
 * }}}
 */
class A

generated html

xuwei-k commented 6 years ago

related scaladocs

Manifest

Iterable<A> -> Iterable

Regex#replaceAllIn

"July <NUMBER>" -> "July "

xuwei-k commented 6 years ago

https://github.com/scala/scala/blob/d4a0df8d02334f2f29d8946bce5a3e3bc28d7ee9/src/scaladoc/scala/tools/nsc/doc/base/CommentFactoryBase.scala#L120-L159

adriaanm commented 6 years ago

The other problem is that it isn't enough to filter tags (as @szeiger pointed out in our team meeting). Malicious code could also hide in attributes, which are then read by JS code in one of the frameworks Scaladoc relies on (or that are added after the fact). I worry about giving a false sense of security with incomplete solutions, and would suggest that html security cannot (fully) be the responsibility of scaladoc. We should of course implement and facilitate best practices.

ashawley commented 6 years ago

Piggy-backing on Adriaan here, there's no reason to let security concerns outweigh bug-fixing due diligence.

It is technically unsafe to let any tag pass through when publishing api docs. However, Scaladoc is a static HTML generator. There's nothing keeping someone from adding script tags after the docs are generated. Also, there's no reason someone couldn't create a hyperlink to a malicious web site. Granted, security features for Web browsers should be able to mitigate malicious sites through blocking, more than arbitrary malicious code, but it is an example of Adriaan's point about "false sense of security". I understand the real concern that a community project could have <script> tags inserted willy-nilly by a contributor. However, beyond that scenario, Scaladoc isn't a web application per se.

It's more important to emphasize the task of identifying what is actually broken here (e.g. failing to escape tags is clearly a bug) and fix each one carefully rather than feel there's a need to rush a wholesale fix for the sake of security.

szeiger commented 5 years ago

failing to escape tags is clearly a bug

It's hard to tell bugs from features considering that there is no real specification for scaladoc's markup language. Dotty is using CommonMark but I didn't find any documentation about how this interacts with the traditional tag-based syntax that scaladoc inherited from javadoc, either.

I think we need to start with a spec based on CommonMark that either describes the integration of existing syntactic features - if it is possible to make this work in a non-hacky way - or, alternatively, a new syntax for scaladoc's CommonMark extensions. The old syntax can still be supported behind a flag for migration (like Dotty already does).

This won't prevent HTML injection per se but it allows us to tell bugs from features. And since a CommonMark AST provides a clean separation of embedded HTML from other language features it is possible to implement filtering of HTML or even disable it entirely.

ashawley commented 5 years ago

For the record, I didn't mean to be so general when I said "failing to escape tags is clearly a bug". I was less than precise. It was in the context of Kenji Yoshida's original report of code blocks containing XML/HTML tags that were not getting escaped, which is an obvious defect that has already been fixed in https://github.com/scala/scala/pull/7183 (and should be fixed in 2.12?)

SethTisue commented 1 year ago

Does anyone understand the conditions under which HTML is passed though vs getting escaped?

An example of standard library Scaladoc that contains HTML tags is https://www.scala-lang.org/api/2.13.10/scala/concurrent/duration/Duration.html, in which the H2 tags in the original source are in fact rendered

But then I did a little experiment locally:

/**
  * Scaladoc test, can I use embedded HTML?
  * <H1>Here's a big header</H1>
  * <ul>
  *  <li>and a bullet list
  * </ul>
  *
  */
class C

and... well, some of the HTML was rendered, some wasn't:

🤷

numero-744 commented 1 year ago

xuwei-k posted this link

https://github.com/scala/scala/blob/d4a0df8d02334f2f29d8946bce5a3e3bc28d7ee9/src/scaladoc/scala/tools/nsc/doc/base/CommentFactoryBase.scala#L120-L159

@SethTisue It seems here that these html tags are replaced by wiki tags, and are not really kept

Hence h1 is considered to be a title and H1 is chased away

SethTisue commented 1 year ago

LOL, I just did a little test and yeah, that check is case sensitive and thus it matters whether you write h1 or H1 😱

I wonder if Scala 3's Scaladoc tool, which is a full reimplementation (though perhaps with some copy and paste from the old version? not sure), is any less haphazard in this area.

numero-744 commented 1 year ago

So it seems that <script> is always escaped. It is not always a good thing, for instance to add a Wavedrom diagram, which can be done with client-side rendering:

* <script type="WaveDrom">
* { "signal": [
*    {"name": "clk",     "wave": "p........."},
*   {"name": "valid"  , "wave": "0101..01.0"},
*   {"name": "ready"  , "wave": "x1x0.1x1.x"},
*   {"name": "payload", "wave": "x=x=..x==x","data":["D0","D1","D2","D3"]}
* ]}
* </script>

The output would look like this:

<!DOCTYPE html>
<html>
  <head>
    <title>Wavedrom test</title>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/wavedrom/2.6.8/skins/default.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/wavedrom/2.6.8/wavedrom.min.js"></script>
    <script type="text/javascript">
        function init() {
            WaveDrom.ProcessAll();
        }
        window.onload = init;
    </script>
  </head>
  <body>
    <p>Some diagram:</p>
    <script type="WaveDrom">
{ "signal": [
  {"name": "clk",     "wave": "p........."},
  {"name": "valid"  , "wave": "0101..01.0"},
  {"name": "ready"  , "wave": "x1x0.1x1.x"},
  {"name": "payload", "wave": "x=x=..x==x","data":["D0","D1","D2","D3"]}
]}
</script>
  </body>
</html>

SethTisue commented 1 year ago

As I mentioned on Gitter, in that case I'd suggest postprocessing Scaladoc's output to add or change it however you want.

scala / bug