russross / blackfriday

Blackfriday: a markdown processor for Go
Other
5.44k stars 600 forks source link

blackfriday seems to double escape valid HTML #403

Open kevinburke opened 7 years ago

kevinburke commented 7 years ago

Let's say I have the following Markdown document:

<pre>quote char: &#34;</pre>

&#34; is a valid html entity number for a double quotation mark. Notably, it is the output produced by Go's html.EscapeString(") function.

If I run it through blackfriday, the ampersand is escaped again, yielding:

<p><pre>quote: &amp;#34;</pre></p>

It seems that blackfriday is escaping an ampersand that is part of a valid entity sequence. The equivalent would be if blackfriday turned &amp; into &amp;amp; (it does not do this).

The following sample program can demonstrate the problem. I encountered this in real code when trying to render code blocks using github.com/alecthomas/chroma, and then compiling entire markdown documents using blackfriday.

package main

import (
    "fmt"
    "html"

    blackfriday "gopkg.in/russross/blackfriday.v2"
)

func main() {
    in := html.EscapeString(`quote: "`)
    fmt.Println(string(blackfriday.Run([]byte("<pre>" + in + "</pre>"))))
}

Here's the Go source and rationale:

var htmlEscaper = strings.NewReplacer(
    `&`, "&amp;",
    `'`, "&#39;", // "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
    `<`, "&lt;",
    `>`, "&gt;",
    `"`, "&#34;", // "&#34;" is shorter than "&quot;".
  )

https://www.w3schools.com/html/html_entities.asp

kevinburke commented 7 years ago

Here is at least one specification documenting the correct treatment of entity references: http://spec.commonmark.org/0.28/#example-302

When I run their reference program on this source code:

<pre>quote: &quot;</pre>

I get

<pre>quote: &quot;</pre>

Running the same file through either v2 or v2-commonmark-testsuite yields the following results:

<p><pre>quote: &amp;quot;</pre></p>