Open nearsyh opened 9 years ago
I’ll have to look at the spec, but I’m not sure that’s well-defined given all of HTML’s error recovery (e.g. optional closing tags). What’s the use case?
This may not be necessary, but I think it makes some work easier. For example,
<div>
<h1>test</h1>
<h2>test</h2>
test
</div>
Adding this function can make it easy to gather all text in the outermost div.
You seem to be trying to do event-based parsing. Given how the tree builder can call append_before_sibling
, remove_from_parent
, or reparent_children
, I’m not sure that really works. (You might incorrect results.) You may have to collect the nodes in a tree data structure like https://github.com/SimonSapin/kuchiki before you can process them.
I suspect @hsivonen and @sideshowbarker have ideas about such apis.
I’m guessing that @nearsyh would like to have a "streaming" HTML parser, like SAX and StAX do for XML. So the question is, given the adoption agency algorithm and friends, is this possible to parse HTML incrementally by buffering less than the entire document? (Where parts of the document can be considered "done" and are not modified again.)
It’s possible to make a buffered SAX API. @hsivonen wrote one for use with the htmlparser he made (the same htmlparser the source of which is also used by gecko as its HTML parser). The sources for that SAX API are at http://hg.mozilla.org/projects/htmlparser/file/default/src/nu/validator/saxtree
However, the code for that API causes the entire HTML document it parses to be buffered; it doesn’t do it by using any strategies to buffer less than the entire document, in the way described in https://github.com/servo/html5ever/issues/149#issuecomment-120936252.
I don’t actually know how practical it would be to try to implement a SAX/SAX-like event-based spec-conforming HTML-parsing API that buffered less than an entire document at one time. I think you could get some of it just by buffering all tables, but beyond that I don’t know what the other partial-buffering strategies would be. But I’m certain @hsivonen could give some insight on it.
(BTW, while the buffered mode is the default for @hsivonen’s SAX API, it also provides a fully-streaming (non-buffered) mode as an option. That’s actually the mode which the validator.nu code uses. However, in that mode, any markup it runs into that would require non-streaming parsing behavior—i.e., adoption agency algorithm and friends—causes a non-recoverable parse error.)
See http://krijnhoetmer.nl/irc-logs/whatwg/20150714#l-117 for a discussion I had with @gsnedders (one of the html5lib devs) about this.
It seems the sad reality is that, as he notes there, the only way to do a streaming API for spec-conformant HTML parsing is to either buffer everything or admit fatal errors for cases that require non-streaming behavior.
My conclusion is that a conforming "streaming" (SAX-like) HTML parser is only doable with trade-offs that are not worth it. (Either bufferring up to the entire document, or introducing fatal errors.)
@nearsyh, could you confirm that’s what you were trying to do?
@nox how does TreeSink::pop
relate to this?
I am not sure we properly call pop
in all circumstances, but I guess it could be piggybacked for this feature.
I am not sure we properly call
pop
in all circumstances, but I guess it could be piggybacked for this feature.
I also would like some way to tell when elements end and was hoping pop
would help. Would it be straightforward to call it in all circumstances or are there blockers to that?
For example, give an html code like
The function will be called when the parser reaches the end tag.