servo / html5ever

High-performance browser-grade HTML5 parser
Other
2.15k stars 222 forks source link

Add a method in TreeSink trait which will be called when an element ends #149

Open nearsyh opened 9 years ago

nearsyh commented 9 years ago

For example, give an html code like

<div>
<!-- some code -->
</div>

The function will be called when the parser reaches the end tag.

SimonSapin commented 9 years ago

I’ll have to look at the spec, but I’m not sure that’s well-defined given all of HTML’s error recovery (e.g. optional closing tags). What’s the use case?

nearsyh commented 9 years ago

This may not be necessary, but I think it makes some work easier. For example,

<div>
  <h1>test</h1>
  <h2>test</h2>
  test
</div>

Adding this function can make it easy to gather all text in the outermost div.

SimonSapin commented 9 years ago

You seem to be trying to do event-based parsing. Given how the tree builder can call append_before_sibling, remove_from_parent, or reparent_children, I’m not sure that really works. (You might incorrect results.) You may have to collect the nodes in a tree data structure like https://github.com/SimonSapin/kuchiki before you can process them.

Ms2ger commented 9 years ago

I suspect @hsivonen and @sideshowbarker have ideas about such apis.

SimonSapin commented 9 years ago

I’m guessing that @nearsyh would like to have a "streaming" HTML parser, like SAX and StAX do for XML. So the question is, given the adoption agency algorithm and friends, is this possible to parse HTML incrementally by buffering less than the entire document? (Where parts of the document can be considered "done" and are not modified again.)

sideshowbarker commented 9 years ago

It’s possible to make a buffered SAX API. @hsivonen wrote one for use with the htmlparser he made (the same htmlparser the source of which is also used by gecko as its HTML parser). The sources for that SAX API are at http://hg.mozilla.org/projects/htmlparser/file/default/src/nu/validator/saxtree

However, the code for that API causes the entire HTML document it parses to be buffered; it doesn’t do it by using any strategies to buffer less than the entire document, in the way described in https://github.com/servo/html5ever/issues/149#issuecomment-120936252.

I don’t actually know how practical it would be to try to implement a SAX/SAX-like event-based spec-conforming HTML-parsing API that buffered less than an entire document at one time. I think you could get some of it just by buffering all tables, but beyond that I don’t know what the other partial-buffering strategies would be. But I’m certain @hsivonen could give some insight on it.

(BTW, while the buffered mode is the default for @hsivonen’s SAX API, it also provides a fully-streaming (non-buffered) mode as an option. That’s actually the mode which the validator.nu code uses. However, in that mode, any markup it runs into that would require non-streaming parsing behavior—i.e., adoption agency algorithm and friends—causes a non-recoverable parse error.)

sideshowbarker commented 9 years ago

See also https://github.com/inikulin/parse5/issues/26#issuecomment-113298544

sideshowbarker commented 9 years ago

See http://krijnhoetmer.nl/irc-logs/whatwg/20150714#l-117 for a discussion I had with @gsnedders (one of the html5lib devs) about this.

It seems the sad reality is that, as he notes there, the only way to do a streaming API for spec-conformant HTML parsing is to either buffer everything or admit fatal errors for cases that require non-streaming behavior.

SimonSapin commented 9 years ago

My conclusion is that a conforming "streaming" (SAX-like) HTML parser is only doable with trade-offs that are not worth it. (Either bufferring up to the entire document, or introducing fatal errors.)

@nearsyh, could you confirm that’s what you were trying to do?

SimonSapin commented 7 years ago

@nox how does TreeSink::pop relate to this?

nox commented 7 years ago

I am not sure we properly call pop in all circumstances, but I guess it could be piggybacked for this feature.

max-heller commented 4 months ago

I am not sure we properly call pop in all circumstances, but I guess it could be piggybacked for this feature.

I also would like some way to tell when elements end and was hoping pop would help. Would it be straightforward to call it in all circumstances or are there blockers to that?