golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.58k stars 17.61k forks source link

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

Open gopherbot opened 10 years ago

gopherbot commented 10 years ago

by martin@probst.io:

I'm not sure if this is a bug or working as intended according to the HTML5 parsing
algorithm, but it seems at least problematic from a user's perspective.

When parsing an HTML document that contains <script> tags, writing out the tokens
received will double escape any contained entities, thus <script> tags don't
round-trip through the tokenizer. See the attached patch which adds two tests for
<script>"</script> (which leads to &#24; as the contents) and
<script>&#34;</script>, which leads to &amp;#34;.

That means re-parsing the output of tokenization adds more and more double escaping.

There is a test for <style> just below the one I added that makes this look
intentional. But this is a real problem: using go.net/html to parse and re-serialize
documents breaks the documents.

Attachments:

  1. script_tags_test.diff (494 bytes)
bradfitz commented 10 years ago

Comment 1:

Labels changed: added repo-net.

Owner changed to @nigeltao.

Status changed to Accepted.

andybalholm commented 10 years ago

Comment 2:

I'm pretty sure that the problem isn't in the tokenization but in the printing.
evanj commented 4 years ago

The workaround I'm using is to use token.Data instead of token.String() for text tokens:

var content string
if tokenType == html.TextToken {
  content = t.Data
} else {
  content = t.String()
}