golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.93k stars 17.66k forks source link

html/template: SGML processing Instructions escaped #30168

Open bep opened 5 years ago

bep commented 5 years ago

Go version go1.12beta2.

package main

import (
    "bytes"
    "html/template"
    "log"
    "strings"
)

func main() {
    log.SetFlags(0)
    tmpl, err := template.New("").Parse(`<?PITarget PIContent?>`)
    if err != nil {
        log.Fatal(err)
    }
    var b bytes.Buffer
    tmpl.Execute(&b, nil)

    s := b.String()
    if strings.Contains(s, "&lt;") {
        log.Fatal(b.String())
    }
}

The above prints &lt;?PITarget PIContent?>.

I would expect it to be left untouched, e.g. <?PITarget PIContent?>.

https://en.wikipedia.org/wiki/Processing_Instruction

kardianos commented 5 years ago

I note that html/template does not import the encoding/xml package https://godoc.org/html/template?imports because html5 is not xml.

bep commented 5 years ago

because html5 is not xml.

Note that XML is not mentioned in my bug report. According to W3C:

HTML is an application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language (SGML).

kardianos commented 5 years ago

https://en.m.wikipedia.org/wiki/Standard_Generalized_Markup_Language

HTML was theoretically an example of an SGML-based language until HTML 5, which browsers cannot parse as SGML for compatibility reasons.

HTML5 is not SGML either.

bep commented 5 years ago

HTML5 is not SGML either.

HTML 5 isn't mentioned in the html/template documentation. Are you suggesting that this issue should not be fixed?

I think it is safe to say that most browser implementations support SGML processing instructions, even if they don't use them. And since most browsers/parsers (except Go's) ignore them, they can be really useful.

katiehockman commented 5 years ago

/cc @mikesamuel for owners input

mikesamuel commented 5 years ago

I think it is safe to say that most browser implementations support SGML processing instructions

Per HTML5 rules, <?PITarget PIContent?> is treated as a bogus comment outside a foreign-content context.

After seeing <, the lexer is in the tag open state which says

U+003F QUESTION MARK (?)

This is an unexpected-question-mark-instead-of-tag-name parse error. Create a comment token whose data is the empty string. Reconsume in the bogus comment state.

which then appends content to the first > to the created comment node.

So HTML 5 may support PIs but using node type 8 (comment) not 7 (processing instruction). This can be seen by running:

let div = document.createElement('div');
div.innerHTML = '<?PI PI?>';
div.firstChild.nodeType === Node.COMMENT_NODE
// -> true

In contrast to the XML grammar, in HTML, <?PITarget > is a complete token.

Other constructs also enter the bogus comment state: <!foo>


It seems there are a few options:

Do nothing

Mentioned merely for completeness.

Recognize <?...> as a comment

html/template elides comments so to be consistent we would drop them from the output.

Retool to take a hybrid of XML and HTML as input.

The best way to move that forward might be to open an issue to collect use cases for preserving PIs in templates, and to flesh out a definition of the desired hybrid grammar.

As you point out, outputting &lt;?PI... is already buggy, so we probably needn't worry overly about backwards compatibility, but I can't speak for the Go maintainers.

mikesamuel commented 5 years ago

@kardianos

I note that html/template does not import the encoding/xml package

Fyi, html/template uses its own hand-written parser that does not assume that a template is a complete document fragment.

Open and close tags need not appear in the same template:

{{define "foo"}}
  <div>{{template "bar"}}
{{end}}
{{define "bar"}}
  </div>
{{end}}

And a template may not be in the same language as its caller:

{{define "html"}}
  <script>{{template "js"}}</script>
{{end}

{{define "js"}}
  alert('There are no <tags> in this <template>');
{{end}
kardianos commented 5 years ago

@mikesamuel Yes, sorry, that was indeed my point.

bep commented 5 years ago

html/template elides comments so to be consistent we would drop them from the output.

Please don't do that. Filtering out comments was probably also a bad idea originally, but I'm not going down that road. In contrast to comments, which I guess is mostly used to inform/remind people about a certain implementation detail, processing instructions have meanings to computers -- removing them would be a breaking thing.

Per HTML5 rules, <?PITarget PIContent?> is treated as a bogus comment outside a foreign-content context.

If html/templates is strictly HTML 5 that should be documented.

mikesamuel commented 5 years ago

@bep

If you want html/template to preserve things that look kind of like processing instructions, please make the case that that is worthwhile. This talk about processing instructions in HTML seems to be appealing to a standard that does not exist (see below).

You seem to be suggesting that using comments that look like PIs is a widespread practice; I have not encountered that in my work, which is why I asked for use cases. Please convince me that this is something developers need. I will need to see some numbers though.

If you do want <?...?> in output, you can do that in html/template today, the same way you can output a regular comment: https://play.golang.org/p/eOAB0UyYbYy . If that is bad ergonomically, please explain how and what template interpolations like <?PI {{.}}?> would mean.

Filtering out comments was probably also a bad idea originally

If you want to demonstrate this, please compare the cost of leaks due to developer comments unintentionally persisting in code to the cost of having to do something like the playground above for the few times you actually need a comment. To the best of my knowledge, the only repeatedly occurring use case for shipping comments in code is MS conditional compilation comments which were deprecated in 2015.

""" Important As of Internet Explorer 10, conditional comments are no longer supported by standards mode. """

I know of no use case for shipping comments to modern browsers that is not met by the playground above. Again, please convince me otherwise.

If html/templates is strictly HTML 5 that should be documented.

The documentation could be more specific, but I'm not just saying processing instructions are not part of HTML 5. I refer to HTML 5 because that's what modern browsers approximate. To my knowledge no version of HTML has ever supported processing instructions.

HTML 4 B3.3 SGML features with limited support says

"""

B.3.3 SGML features with limited support

SGML systems conforming to [ISO8879] are expected to recognize a number of features that aren't widely supported by HTML user agents. We recommend that authors avoid using all of these features.

...

B.3.6 Processing Instructions

...

Authors should be aware that many user agents render processing instructions as part of the document's text. """

Browsers do not and have not recognized processing instructions in HTML. What changed between HTML 4 and HTML 5 was how they chose not to parse processing instructions as Node.PROCESSING_INSTRUCTION.

bep commented 5 years ago

If you want youhtml/template to preserve things that look kind of like processing instructions, please make the case that that is worthwhile.

The html/template package is there mainly for security concerns, not to make opinionated adjustments to people's applications.

So:

I want the HTML parser to preserve anything that isn't a security concern. There are obvious use cases for using SGML processing instructions in HTML. I have a few of them myself, which is why I created this issue. I don't see why I need to detail them in this topic. The current behavior obviously needs to be fixed. And I cannot see how you can fix it by removing these processing instructions, as that would make a broken solution even more broken. So I suggest you do as very other HTML parsers on the planet: Leave them alone.

I will leave this thread now. Do as you please, but please consider my concerns above.

bep commented 5 years ago

Browsers do not and have not recognized processing instructions in HTML. What changed between HTML 4 and HTML 5 was how they chose not to parse processing instructions as Node.PROCESSING_INSTRUCTION.

I think you need to look at Go's HTML handling as very, very standard/basic and version agnostic. HTML 5 was released in 2014, Go 1.0 long, long before that.

mikesamuel commented 5 years ago

@bep

There are obvious use cases for using SGML processing instructions in HTML.

Great. I look forward to seeing them.

Browsers do not and have not recognized processing instructions in HTML. ...

I think you need to look at Go's HTML handling as very, very standard/basic and version agnostic. HTML 5 was released in 2014, Go 1.0 long, long before that.

How is Go's HTML handling relevant to what the HTML standard is?

akbyrd commented 1 week ago

Is it valid to use html/template with XML content? If so, stripping or escaping <?xml version="1.0" encoding="utf-8"?> is undesirable.

My understanding is that this type of processing instruction is also sometimes used in HTML. e.g. <?php echo "Hello, World!"; ?>.