golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.23k stars 17.47k forks source link

x/net/html: provide convenience methods FirstElementChild/NextElementSibling #15996

Open RayfenWindspear opened 8 years ago

RayfenWindspear commented 8 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.1 linux/amd64

  1. What operating system and processor architecture are you using (go env)?

GOARCH="amd64" GOBIN="" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/user/go" GORACE="" GOROOT="/usr/lib/go-1.6" GOTOOLDIR="/usr/lib/go-1.6/pkg/tool/linux_amd64" GO15VENDOREXPERIMENT="1" CC="gcc" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0" CXX="g++" CGO_ENABLED="1"

  1. What did you do? If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

Here is a link to golang playground, but note that this library doesn't actually run on the playground. I am providing a playground link because I cannot format the code correctly here given the string formatting I am using.

https://play.golang.org/p/DrERE6s0tT

  1. What did you expect to see?

The last printed div is supposed to have children as per the html string.

  1. What did you see instead?

Instead it is parsed as a Text NodeType, and thus has no children, or even data.

RayfenWindspear commented 8 years ago

One other thing to note is that calling html.Render will reproduce the missing elements, despite them not being represented as Nodes.

kennygrant commented 8 years ago

Nodes also contain TextNodes for their content (including whitespace), which is what you're seeing here. You can test this with a simpler version of the html, with some whitespace before the last section tag, which becomes the last child node:

<section><div><div></div></div> </section>

Adding a test for section.LastChild.Type to your example shows it is of type TextNode (either with this simpler html or with your example). Removing the whitespace from your example means it works as you expect.

https://play.golang.org/p/uGfE2LTsjg

So I don't think this is a bug?

RayfenWindspear commented 8 years ago

Isn't whitespace supposed to be ignored as part of the HTML specification? I'm not sure if this is the right reference, but this portion of the HTML 5 spec is about space separated tokens, which I'm assuming should include tokenization of tags.

https://html.spec.whatwg.org/multipage/infrastructure.html#space-separated-tokens

RayfenWindspear commented 8 years ago

This may be a better resource. It specifically mentions inter-element-whitespace. It states that these Text nodes consisting of only whitespace must be ignored. The paragraph above defines inter-element-whitespace.

https://www.w3.org/TR/html/dom.html#inter-element-whitespace

With the html package behaving this way, it doesn't match the way the html is parsed into a DOM by a browser. If the browser had recognized these nodes as Text Nodes, there would be raw html bleeding through all over the web due to a few misplaced whitespaces. And there would be no way to access these children through either raw JS or jQuery.

kennygrant commented 8 years ago

Can you reproduce your expected behaviour in a browser using the DOM? If I try this in Chrome or Safari on your snippet or the simpler document (NB whitespace):

<html><head></head><body><section><div><div></div></div> </section></body></html>

using a similar query:

document.firstChild.lastChild.firstChild.lastChild

I get a textNode with whitespace content, which mirrors what this parser is doing. In your second link it does say they should parse whitespace as text nodes, but then says these should be ignored in the green box (which seems to contradict what browsers do). The DOM does have separate firstElementChild methods to ignore comments and whitespace, but firstChild etc do return white space.

I will leave it to the maintainers to comment further but I hope this helps a little.

RayfenWindspear commented 8 years ago

Interesting. I think I see what is going on here. It's not that the browser doesn't support this, it supplies it in a different way. Here is the functionality I have been expecting document.firstElementChild, which skips the empty Text Nodes as in the html specification. However, the net/html package only supplies the functionality of the form document.firstChild, which does not.

Given this realization, yes you are right to say that the package isn't malfunctioning. So in effect, this is actually a feature request rather than a bug report.

EDIT: Hah, I read your example and tried it before I finished reading your post and ended up discovering the same new methods.

RayfenWindspear commented 8 years ago

Chose not to edit previous post.

So the functionality I am looking for is easily provided by a for loop skipping over Text Nodes. Doing this manually provides me with the data that I am looking for as in this example.

https://play.golang.org/p/ZNKgZATPUI

Should I close the issue and go through the process to be able to add the functionality myself (I know pull requests on github are not granted), or is someone willing to add the feature for me?

Basically what is working for me amounts to this

`func firstElementChild(n html.Node) html.Node { for c := n.FirstChild; c != nil; c = c.NextSibling { if c.Type == html.ElementNode { return c } } return nil }

func nextElementSibling(n html.Node) html.Node { for s := n.NextSibling; s != nil; s = s.NextSibling { if s.Type == html.ElementNode { return s } } return nil }`

quentinmit commented 8 years ago

@RayfenWindspear You can use this issue for tracking that addition. I don't know how likely it is to be accepted, though, since as you demonstrated it is easy to implement yourself.

/cc @adg @nigeltao Are we likely to add these methods?

nigeltao commented 8 years ago

Yes, whitespace is ignored, but at rendering time, not at parse time. kennygrant gave a javascript example, but for a Python example, this program:

#!/usr/bin/python

import html5lib

h = '''<section id="job-snapshot-section">
    <h2>Job Snapshot</h2>
    <div class="section-body">

        <div class="snap-line">
            <strong><span id="CBBody_Snapshot__ctl10_SnapshotKey">Base Pay</span></strong>
            <span><span id="CBBody_Snapshot__ctl10_SnapshotValue">$150,000.00 - $160,000.00 /Year</span></span>
        </div>                                        
    </div>
</section>'''

doc = html5lib.parse(h)
for t in doc.itertext():
    print "[" + t + "]"

prints

[
    ]
[Job Snapshot]
[
    ]
[

        ]
[
            ]
[Base Pay]
[
            ]
[$150,000.00 - $160,000.00 /Year]
[
        ]
[                                        
    ]
[
]

because those (whitespace) text nodes are part of the DOM tree.

As for a FirstElementChild method, I suppose it's possible, although it's straightforward to provide that as a function (in your package). It doesn't need to be a method (in package html).

Or, if you want to loop only over element children, it's trivial:

for c := n.FirstChild; c != nil; c = c.NextSibling {
  if c.Type != html.ElementNode {
    continue
  }
  etc
}

so I'm not convinced yet for the need for FirstElementChild and NextElementSibling methods.