Open adriaanm opened 6 years ago
A good start for a whitelist would be: https://meta.stackexchange.com/questions/1777/what-html-tags-are-allowed-on-stack-exchange-sites (though we should add table-related ones).
Also, note that we emit diagrams/highlighted code using the Raw mechanism, so we should make sure not to break those :-)
Another inspiration can be the whitelist that MediaWiki uses: https://meta.wikimedia.org/wiki/Help:HTML_in_wikitext
One place to put this logic would be here, as all raw html passes through: https://github.com/scala/scala/blob/v2.13.0-M5/src/scaladoc/scala/tools/nsc/doc/html/HtmlTags.scala#L47-L52
EDIT: actually, we should only escape in this usage of Raw
: https://github.com/scala/scala/blob/v2.13.0-M5/src/scaladoc/scala/tools/nsc/doc/html/HtmlPage.scala#L122
It also might be worth looking at what GitHub allows in GFM
Ideally, we'd just disallow html and only use GFM in docs :-)
Currently, we simply pass through any html tags from source to the documentation.
That's not quite correct. Maybe almost escape html tags. see following example.
https://github.com/scala/scala/pull/7183
/**
* <script>alert('escape')</script>
*
* {{{
*
* <script>alert('also escape')</script>
*
* // <script>alert('not escape in comment')</script>
*
* "<script>alert('not escape in string literal')</script>"
*
* }}}
*/
class A
related scaladocs
Iterable<A>
-> Iterable
"July <NUMBER>"
-> "July "
The other problem is that it isn't enough to filter tags (as @szeiger pointed out in our team meeting). Malicious code could also hide in attributes, which are then read by JS code in one of the frameworks Scaladoc relies on (or that are added after the fact). I worry about giving a false sense of security with incomplete solutions, and would suggest that html security cannot (fully) be the responsibility of scaladoc. We should of course implement and facilitate best practices.
Piggy-backing on Adriaan here, there's no reason to let security concerns outweigh bug-fixing due diligence.
It is technically unsafe to let any tag pass through when publishing api docs. However, Scaladoc is a static HTML generator. There's nothing keeping someone from adding script tags after the docs are generated. Also, there's no reason someone couldn't create a hyperlink to a malicious web site. Granted, security features for Web browsers should be able to mitigate malicious sites through blocking, more than arbitrary malicious code, but it is an example of Adriaan's point about "false sense of security". I understand the real concern that a community project could have <script>
tags inserted willy-nilly by a contributor. However, beyond that scenario, Scaladoc isn't a web application per se.
It's more important to emphasize the task of identifying what is actually broken here (e.g. failing to escape tags is clearly a bug) and fix each one carefully rather than feel there's a need to rush a wholesale fix for the sake of security.
failing to escape tags is clearly a bug
It's hard to tell bugs from features considering that there is no real specification for scaladoc's markup language. Dotty is using CommonMark but I didn't find any documentation about how this interacts with the traditional tag-based syntax that scaladoc inherited from javadoc, either.
I think we need to start with a spec based on CommonMark that either describes the integration of existing syntactic features - if it is possible to make this work in a non-hacky way - or, alternatively, a new syntax for scaladoc's CommonMark extensions. The old syntax can still be supported behind a flag for migration (like Dotty already does).
This won't prevent HTML injection per se but it allows us to tell bugs from features. And since a CommonMark AST provides a clean separation of embedded HTML from other language features it is possible to implement filtering of HTML or even disable it entirely.
For the record, I didn't mean to be so general when I said "failing to escape tags is clearly a bug". I was less than precise. It was in the context of Kenji Yoshida's original report of code blocks containing XML/HTML tags that were not getting escaped, which is an obvious defect that has already been fixed in https://github.com/scala/scala/pull/7183 (and should be fixed in 2.12?)
Does anyone understand the conditions under which HTML is passed though vs getting escaped?
An example of standard library Scaladoc that contains HTML tags is https://www.scala-lang.org/api/2.13.10/scala/concurrent/duration/Duration.html, in which the H2 tags in the original source are in fact rendered
But then I did a little experiment locally:
/**
* Scaladoc test, can I use embedded HTML?
* <H1>Here's a big header</H1>
* <ul>
* <li>and a bullet list
* </ul>
*
*/
class C
and... well, some of the HTML was rendered, some wasn't:
🤷
xuwei-k posted this link
@SethTisue It seems here that these html tags are replaced by wiki tags, and are not really kept
Hence h1
is considered to be a title and H1
is chased away
LOL, I just did a little test and yeah, that check is case sensitive and thus it matters whether you write h1
or H1
😱
I wonder if Scala 3's Scaladoc tool, which is a full reimplementation (though perhaps with some copy and paste from the old version? not sure), is any less haphazard in this area.
So it seems that <script>
is always escaped. It is not always a good thing, for instance to add a Wavedrom diagram, which can be done with client-side rendering:
* <script type="WaveDrom">
* { "signal": [
* {"name": "clk", "wave": "p........."},
* {"name": "valid" , "wave": "0101..01.0"},
* {"name": "ready" , "wave": "x1x0.1x1.x"},
* {"name": "payload", "wave": "x=x=..x==x","data":["D0","D1","D2","D3"]}
* ]}
* </script>
The output would look like this:
<!DOCTYPE html>
<html>
<head>
<title>Wavedrom test</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/wavedrom/2.6.8/skins/default.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/wavedrom/2.6.8/wavedrom.min.js"></script>
<script type="text/javascript">
function init() {
WaveDrom.ProcessAll();
}
window.onload = init;
</script>
</head>
<body>
<p>Some diagram:</p>
<script type="WaveDrom">
{ "signal": [
{"name": "clk", "wave": "p........."},
{"name": "valid" , "wave": "0101..01.0"},
{"name": "ready" , "wave": "x1x0.1x1.x"},
{"name": "payload", "wave": "x=x=..x==x","data":["D0","D1","D2","D3"]}
]}
</script>
</body>
</html>
As I mentioned on Gitter, in that case I'd suggest postprocessing Scaladoc's output to add or change it however you want.
Currently, we simply pass through any html tags from source to the documentation. @xuwei-k pointed out that this could be unsafe (say, someone puts a
<script>
tag with malicious content in a scaladoc comment and it slips by PR review).To make Scaladoc safer by default, I propose we whitelist allowed tags, and warn/error/entity-encode (on) other tags (tbd).
We should also offer a flag to opt out of this behavior.