golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.65k stars 17.61k forks source link

x/net/html: void element <link> has child nodes #10535

Open dvyukov opened 9 years ago

dvyukov commented 9 years ago

The following program crashes:

package main

import (
    "golang.org/x/net/html"
    "io/ioutil"
    "strings"
)

func main() {
    nodes, err := html.ParseFragment(strings.NewReader("<svg ><link >0"), nil)
    if err != nil {
        return
    }
    for _, n := range nodes {
        if err := html.Render(ioutil.Discard, n); err != nil {
            panic(err)
        }
    }
}
panic: html: void element <link> has child nodes

Render must be able handle Parse output. Or otherwise Parse must not accept the input as valid.

On commit 6f62f426de90c0ed6a55207b51476115fcb17237.

dvyukov commented 9 years ago

Similar crash for "<svg ><input >0"

dvyukov commented 9 years ago

Similar crash for `"0"``

dvyukov commented 9 years ago

the same for area element

andybalholm commented 9 years ago

A "correct" parse according to the HTML5 spec can produce a parse tree that is not well-formed. But what should we do about it in Render? My initial thought would be that if a void element has child nodes, we should just render it as if it weren't a void element. Of course that is garbage, but it is in keeping with the principle of garbage in, garbage out.

dvyukov commented 9 years ago

If an input is parsed, then I would expect it to be serialized into its original form. The next HTML5-complaint parser in the chain should be able to parse it successfully again, right?

andybalholm commented 9 years ago

That sounds good, but it is very far from the case with HTML5. HTML5 takes Postel's principle to the extreme: any random string of data can be parsed as an HTML5 document. (It probably won't validate, but it will parse.) There are places in the spec where parsers are allowed to return errors, but they also specify what the result should be if they don't. In all those cases, we chose to have the html package continue parsing instead of returning an error.

Having the rendered output exactly the same as the input is actually quite rare, because of things like capitalization and attribute quoting. <INPUT TYPE=TEXT> renders as <input type="text" />. But where things get really interesting is in situations like this where the parse tree is not well-formed. See http://godoc.org/golang.org/x/net/html#Render for more details.

dvyukov commented 9 years ago

Well, at least it should be parsed by net/html parser again and maybe by some subset of other parsers. I see two use cases:

  1. analyze the HTML in memory or 2. analyze, modify and serialize HTML And the second case does not work now for semi-invalid documents. Assuming that the original document is parsed successfully by the target HTML parser (e.g. a permissive browser parser), if we serialize an invalid document the same way it was in the original document, it still will be parsed by the target parser, so we did not make things worse. We just allow to modify and serialize semi-invalid documents.
andybalholm commented 9 years ago

I agree that returning an error from Render is probably not what we want to do in this case. And we should do what we reasonably can to make invalid trees render sensibly. (Actually I suspect that this tree is valid; does SVG allow children for those elements?)

But we don't want to fill the Render function with special-case heuristics that attempt to reconstruct the markup underlying invalid parse trees.

For your use case #2, it would be very desirable for Render to never return an error unless the destination Writer returns one. That would make its operation parallel to Parse, which accepts all possible inputs. @nigeltao, what do you think? Should we eliminate errors from Render, and just make it produce the least-wrong output practical?

nigeltao commented 9 years ago

I think this is WAI. Render's doc comment already says, "Rendering is done on a 'best effort' basis".

The HTML5 parsing algorithm (https://html.spec.whatwg.org/multipage/syntax.html#tree-construction) is an enormous nest of special cases. It prints at 130 pages. I don't think that it guarantees to be idempotent, or even self-consistent (e.g. "void elements" are separately listed in a separate document at http://www.w3.org/TR/html5/syntax.html#void-elements), and I can't see an obvious proof that it is either.

Given that, the Rendering algorithm is naive, and doesn't promise to render everything you can parse. In any case, the error in the original post is indeed accurate: the void element has child nodes, and it shouldn't.

You could argue that Render could return a special RenderingError, separate from I/O errors, that callers are free to ignore. But in general, the space of "semi-invalid documents" is ill-defined, and as I said earlier, it's not clear that the parser algorithm guarantees to produce a "valid document", so I don't think Render should never return errors (other than I/O errors).

You could argue, then that the parser shouldn't have produced a tree that was like that, but the parser is what it is, specified by a 130-page spec that's already too complicated. It would be nice if HTML5 was based on sensible, consistent foundations, but it isn't.

andybalholm commented 9 years ago

As I think about this more, I think the real issue is that we are applying the HTML void element list to SVG elements. I'm pretty sure that the input element that has children is an svg:input (though input doesn't mean anything that I know of in that namespace), not a regular HTML input. So maybe we should check the namespace before we check the void element list, since the concept of void elements doesn't apply to foreign content.