golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.5k stars 17.6k forks source link

x/net/html: incorrect handling nodes in the <head> section #23064

Open jdeng opened 6 years ago

jdeng commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go1.9.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64" GOBIN="" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOOS="darwin"

What did you do?

package main

import (
    "bytes"
    "fmt"
    "golang.org/x/net/html"
    "strings"
)

func main() {
    s := "<html><head><meta/><a/><style></style></head><body></body></html>"
    doc, _ := html.Parse(strings.NewReader(s))

    var buf bytes.Buffer
    if err := html.Render(&buf, doc); err == nil {
        fmt.Println(buf.String())
    }
}

What did you expect to see?

Same structure as the input

 <html><head><meta/><a/><style></style></head><body></body></html> 

What did you see instead?

Nodes from are moved into the part

 <html><head><meta/></head><body><a><style></style></a></body></html> 
jdeng commented 6 years ago

Adding the default handling code in parser.go/inHeadIM helped to solve this but I am not sure whether there are side effects.

    case StartTagToken:
        switch p.tok.DataAtom {
  ...
        case a.Head:
            // Ignore the token.
            return true
        default:
            // Ignore the token.
            return true
        }
namusyaka commented 6 years ago

The a element is not metadata content. So I think the behavior is not kindness but not wrong, because it is not treated as a child of the head element. I think there are two approaches to this problem.

  1. The first is to return a parse error for such illegal html, it's the ease to understand, but it will largely destroy backward-compatibility.
  2. The second is to make the document about Parse more kind and prompt warning.
jdeng commented 6 years ago

Thanks. I think it is wrong because it misplaces the following <style> node silently. Rejecting the HTML because of unrecognized nodes in the head section seems a little too strict and it will be a problem for quite some Microsoft office generated HTML docs.

namusyaka commented 6 years ago

Certainly, it is a problem that the style element is unintentionally placed. But that is due to the fact that the style element is handled as a child of the a element. I think your expectation value looks incorrect. Any another ideas?

The current implementation begins with parsing the head element when html and head elements are omitted. If we deal with this problem properly we have to be able to properly parse the HTML as follows.

"<head><a/><style></style>" and "<a/><style></style>"

namusyaka commented 6 years ago

I'll visit again after resolveing #23071.

namusyaka commented 5 years ago

Sorry for the delay and my confusing comment. While investigating this issue, I found a fact that the behavior is the same with chrome's one.

So please let me change my opinion. This is an expected behavior, I think.