google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 660 forks source link

any interest in modifying code for binary search of tag names and for tag_in ? #287

Closed kevinhendricks closed 9 years ago

kevinhendricks commented 9 years ago

Hi, I have modified gumbo-parser to expand the TagNames to include the full set of presentation elements in MathML and the full SVG tag list. (according to the latest spec). Given the size of the new tag list, a binary search is needed to quickly assign GumboTag enums.

But since we now can work with sorted lists of integers and complete sets of SVG and MathML tags we can easily use binary search on sorted integer lists to detect if a tag is a valid MathML tag (ie. part of the presentation subset) or a valid SVG tag.

With this capability in place we can convert all of the calls to tag_in in parser.c and its related cousins to properly handle any C pre-processor determined list of tags (no more varargs needed) and to use binary search to determine if a tag is in that list.

With this and the capability to quickly determine if something is a valid SVG or MathML element, you could easily modify the parser to properly prevent the confusion that comes when mislabling html tags as mathml or svg tags. which caused a number of issues.

I have completed part one of this where I modify src/tag.c, add src/tag.h, and update the gumbo.h tag enum or expand its contents to properly match the list in tag.c.

Before I go to the trouble of converting all of the varargs calls to tag_in and its relatives in parser.c to use pre-determined tag lists (pre-sorted), I was hoping someone would look at the part 1 patch and let me know if they are at all interested.

kevinhendricks commented 9 years ago

I am happy to e-mail the part-1 patch (using binary search and much larger recognized tag enum) to anyone who wants it for review as my own gumbo-parser fork on github has a number of other changes you will not want (working with xhtml parsing, error reporting, etc). Just let me know where to send it if anyone is interested.

kevinhendricks commented 9 years ago

Hi, I committed it to my fork so that the commit would show up and you could more easily decide if this is in any way the direction you would like to go.

https://github.com/kevinhendricks/gumbo-parser/commit/ae3b57af872a1eb9788f444a9f7de279f09e9b65

kevinhendricks commented 9 years ago

And here is what a parser modified to replace varargs calls with const int sorted lists for tag_in calls looks like. A similar approach can be used for the remaining use of varargs for the in_scope calls as well.

https://github.com/kevinhendricks/gumbo-parser/commit/4882c85ab0ba033d6559ee389e923b8c16732a21

That version has some of my changes for XHTML5 parsing but I would be happy to send a patch against a clean version from today's master if anyone is interested.

Kevin

nostrademons commented 9 years ago

I'm a little concerned about the impact of this on other language bindings, many of which provide enums for GumboTag that are defined in the scripting language. At the very least, the Python bindings would have to be updated. @craigbarnes, @rubys, @karlwestin, @rgrove, would changing the values of the GumboTag mess things up for the Lua, Ruby, Node bindings or for Sanitize? Any other bystanders have concerns? I do like the idea of the tag lists being more complete, but I don't want to make unnecessary work for the 10+ other language bindings, and I suspect that SVG/MathML parsing is a relatively niche use case compared to HTML and don't want to bloat the API for common usages.

I don't like the usage of numbered const int lists in the other patch. It's important that it be possible to follow along with the spec while reading the code, and this separates the declaration of the list from its usage. I would rather pass a bitset by value and construct it with some macro magic than have to jump between file locations to read every clause of the spec.

rgrove commented 9 years ago

It looks to me like Nokogumbo only relies on gumbo_normalized_tagname(), so I think it would be unaffected (and by extension Sanitize would be unaffected).

kevinhendricks commented 9 years ago

Understood about wanting to keep the lists right in the code. I did not think you could pass an initializer as an argument, so I decided to move them to the top of each routine to try to keep them as close as possible to their use point. I will try to see if there is any macro definition that would help in this case.

My use case is epub2 and epub3 so svg is common to expand cover images and mathml is required for the epub3 spec. I was able to fix your mismatched ns tag case where mathml is used in an html table using an extra if to check if that tag really was part of svg or mathml before making it part of those namespaces in handle_token. There are no mathml td or tr elements. So it is easy to detect.

So doing binary searches an a flat enum of tags is a viable solution. Our project Sigil will use that approach since we need xhtml parsing of non-void but self-closed tags as well and so must maintain our own version of gumbo anyway.

Thanks for looking at it.

Kevin

kevinhendricks commented 9 years ago

As it turns out you can pass an initializer list in an argument in C99 if you cast it properly.. In other words, you could do the following in an argument to a fuction that expects an int pointer in that position:

tag_in(token, (int []) { TAG1, TAG2, TAG3, TAG4 }, cnt);

and

nodetag_in(node, (int []) { TAG1, TAG2, TAG3, TAG4 }, cnt);

I will give that a try for our own version.

kevinhendricks commented 9 years ago

FWIW, here is the commit to putting back the now sorted taglists back where they belong oin the code using the C99 initializer as argument approach:

https://github.com/kevinhendricks/gumbo-parser/commit/ea248f03bcbf6ce272e8269198add05dda00a138

nostrademons commented 9 years ago

I've thought a bit more about this in context of some of the other upcoming patches that will likely go in before 1.0.

I do want to extend the tag list to include common SVG and MathML tags. It'll make the parser more complete and a bit easier to use for future projects that wish to use this. Plus, a couple new tags (notably and ) have been added to the W3C spec with special handling, and so we can't get away from adding more tags to the enum without sacrificing spec compatibility.

However, I want to honor backwards-compatibility promises and make migration as easy as possible. A quick survey of the language bindings that I know of showed that D, Lua, C#, and Python were dependent upon the enum list. Most of these copied source code in directly (often, their FFI is structured as a C parser that automatically extracts the language binding information from a C header file), so they won't immediately break, but I'd still like to minimize arbitrary migration work. So there are a couple restrictions on how new tags can go in:

  1. They go at the end of the GumboTag enum. This means the integer values of enum constants in other languages won't suddenly change; worst-case, their GUMBO_TAG_UNKNOWN handling won't trigger when they expect it to. (Language/library bindings should consider unknown tags as >= GUMBO_TAG_UNKNOWN, to allow for future expansion.) This also means that binary search won't work, because we can't re-sort the enum as new tags are added. If we're particularly worried about performance we can use Ragel to build a finite state machine for matching tags; we already use this quite effectively for entity references, and it should be even faster than binary search.
  2. I'm holding off on merging any patches with new tags until 0.10, so we can honor the semantic versioning promise in the README. Likely included in 0.10: next/prev fields, template tag, fixes, and possibly new tags. It's API additions only, no changes, so minor version increment only. I'd like to do a bugfix-only release (0.9.3) before then so people can get the benefits of correctness fixes without the risk of API changes.

@aroben's patch #286 is going in, so you probably don't want to do major surgery to any of the node_tag_in stuff until after the HTML_QN stuff goes in. It's not sufficient (in HTML5) to detect HTML vs. SVG vs. MathML in the tokenizer, because the same tag may be in a different namespace depending on where in the DOM it is. For example, is both an SVG tag and an HTML one; it's in the SVG namespace if it occurs within an <svg> element and in HTML otherwise. So the QualName stuff is necessary to write a correct parser, which is the priority for Gumbo.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rubys"><img src="https://avatars.githubusercontent.com/u/4815?v=4" />rubys</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Regarding nokogumbo, I'm not overly worried about breaking changes. If you make such a change, I'll keep up. But since you are planning future releases, fragment parsing and error reporting items from the wish list are both of interest.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>FWIW, personally, I would not worry about downstream breakage at all. It is a trivial fix period. It is better to get it updated now imho. Hell it took me all of 5 seconds to update the gumboc.py code. </p> <p>And if you look at the patch, it does examine the current node of the tree in the DOM in parser.c in handle_token. It simply checks to see if a tag on the token could actually be part of svg or mathml namespaces only after <em>all</em> other tests are done before assigning it to be handled as foreign content. Svg and mathml are xml based namespaces and not html soup. They have a clearly defined set of tags period. The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces. This is easy to check and prevents pure html tags from being pushed incorrectly into foreign content. As for tag overlap, there are only 5 tags that exist in common in the svg and html namespaces. They are "a, title, script, font, style". If you read the html5 spec it is written understanding that overlap. For example see how the spec checks the attributes of font to see if it is part of html or svg. The same goes with how title is used in the spec and script and style. The spec itself was written understanding those specific cases. Once they are handed, the tag name is enough to determine if it is legal in the svg or mathml namespaces. </p> <p>So would you please cite one instance in the spec where the tag itself is not enough information given how the conditions are written? I could not find any but I must be missing it. Even the integration points code properly handles cases in mathml and svg when an html tag may happen. The spec never says check mo in the mathml namespace, it simply says to check the tag is mo.</p> <p>And, fwiw, it is simply not good coding to use an uintptr type for an enum just so a int value can be stored in a void pointer in a gumbo vector to generate a self-growing list of what are effectively ints. There is no type safety in that. In C in general with all of its void pointers flying around there is no real type safety but making an enum a pointer type just to enable it to be cast to a void* and back is truly not a good idea.</p> <p>As for correctness vs speed, you are right, correctness should come first, Correctness should also trump any API concerns as well and, backwards compatibility should not be a concern until correctness is reached. All of that said, it makes no sense to ignore simple speedups the don't impact correctness.</p> <p>Thank you for at least considering the patch. I will of course split out the simple other fixes you want and pass them along as separate pull requests once I get a free moment. We (Sigil) will simply go with our fork of gumbo until these issues are addressed on your side in some way. I do encourage you to dismiss your backwards compatibility concerns on a pre 1.0 version of a library whose spec is constantly changing underneath you since the downstream fixes are trivial and simply not worth worrying about. Instead, I would focus on releasing early and often with quick improvements and bug fixs .... but that is of course your call not mine. </p> <p>Take care,</p> <p>Kevin</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/aroben"><img src="https://avatars.githubusercontent.com/u/917945?v=4" />aroben</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces.</p> </blockquote> <p>I don't think this is quite right. I believe a <code>math:td</code> node is actually what the HTML parsing spec expects in this case. You can see this in the <a href="http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Ctable%3E%3Ctd%3E%3C%2Ftable%3E%0A%3Cmath%3E%3Ctd%3E">Live DOM Viewer</a>, which represents elements in the HTML namespace with uppercase tag names and elements in other namespaces with lowercase tag names. Or you can see it in <a href="http://jsbin.com/furagujepe/1">this JSBin</a>.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Again, td does not exist in the namespace for mathml. Check out its DTD (<a href="http://www.w3.org/Math/DTD/Overview.html">http://www.w3.org/Math/DTD/Overview.html</a>). The html5 spec only uses Presentation MathML which is even a subset of that DTD. If you look closely you will see Presentation Mathml has it own table elements mtr, and mtd. So unless you are rewriting MathML's DTD on the fly, the tag "math:td" simply does not exist and can never be valid mathml.</p> <blockquote> <p>On Feb 10, 2015, at 9:26 AM, Adam Roben notifications@github.com wrote:</p> <p>The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces.</p> <p>I don't think this is quite right. I believe a math:td node is actually what the HTML parsing spec expects in this case. You can see this in the Live DOM Viewer, which represents elements in the HTML namespace with uppercase tag names and elements in other namespaces with lowercase tag names. Or you can see it in this JSBin.</p> <p>— Reply to this email directly or view it on GitHub.</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/aroben"><img src="https://avatars.githubusercontent.com/u/917945?v=4" />aroben</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>I definitely believe that it results in invalid MathML. But it's possible to create all kinds of non-conforming documents using HTML. It's how browsers behave, and so it's how the spec is written.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <p>That is true in the html namespace not in the svg or mathml namespace where the parser spec falls back to parsing using xhtml rules. Putting a mathml prefix on a td tag inside of the math tag in no way makes it a valid mathml tag of any kind. There are places you can mix in valid html tags into both svg and mathml namespace (but still valid html tags) which is why the spec has all of the "integration point" rules.</p> <p>Kevin</p> <blockquote> <p>On Feb 10, 2015, at 10:01 AM, Adam Roben notifications@github.com wrote:</p> <p>I definitely believe that it results in invalid MathML. But it's possible to create all kinds of non-conforming documents using HTML. It's how browsers behave, and so it's how the spec is written.</p> <p>— Reply to this email directly or view it on GitHub.</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/gsnedders"><img src="https://avatars.githubusercontent.com/u/176218?v=4" />gsnedders</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p><code><math><td></code> should definitely create an element whose local name is <code>td</code> in the namespace <code>http://www.w3.org/1998/Math/MathML</code>. The td token is parsed according to the rules for parsing tokens in foreign content, and it's not one of the tokens that causes one to break out of foreign content.</p> <p>The integration points exist to support pre-existing structures that already exist (like <code>annotation-xml</code> in MathML/SVG); they have nothing to do with invalid elements.</p> <p>MathML and SVG within HTML can create invalid elements in the MathML/SVG namespaces, much like how their XML serializations can do so (at least with a non-validating parser).</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/aroben"><img src="https://avatars.githubusercontent.com/u/917945?v=4" />aroben</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>That is true in the html namespace not in the svg or mathml namespace where the parser spec falls back to parsing using xhtml rules.</p> </blockquote> <p>Can you point me to the part of the spec that says to use XHTML rules? I've read a decent bit of the parsing spec but definitely not all of it so I probably just missed it.</p> <blockquote> <p>Putting a mathml prefix on a td tag inside of the math tag in no way makes it a valid mathml tag of any kind.</p> </blockquote> <p>I'm not trying to claim that this situation results in a "valid mathml tag". My only point is that Gumbo's behavior in this situation matches the behavior of browsers (as demonstrated using the Live DOM Viewer and JSBin), and also matches my (likely flawed) understanding of the spec. If Gumbo really matches both browsers and the spec then it seems like it's doing the right thing.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <blockquote> <p>On Feb 10, 2015, at 10:13 AM, Geoffrey Sneddon notifications@github.com wrote:</p> <math><td> should definitely create an element whose local name is td in the namespace http://www.w3.org/1998/Math/MathML. The td token is parsed according to the rules for parsing tokens in foreign content, and it's not one of the tokens that causes one to break out of foreign content. </blockquote> <p>But that namespace has no local tag called td and the mathml dtd disallows it specifically.</p> <p>So that is certainly news to me. Where exactly did you find that out? Would you please provide some reference in the spec for that? </p> <p>If that is really correct, then there is no real polyglot version of html5 code that would be safe in both html5 and xhtml5 worlds if they used any of this nonsense. You simply can not stuff any old html valid tags inside true xhtml svg and mathml worlds and expect it to work in any sane way outside of the integration point regions (such as fomatting tags after a <desc> tag and the like) and hope to create valid code</p> <blockquote> <p>The integration points exist to support pre-existing structures that already exist (like annotation-xml in MathML/SVG); they have nothing to do with invalid elements.</p> </blockquote> <p>From my reading I thought they determined the allowable escape points where valid html namespace tags can be used inside the mathml and svg namespaces.</p> <blockquote> <p>MathML and SVG within HTML can create invalid elements in the MathML/SVG namespaces, much like how their XML serializations can do so (at least with a non-validating parser).</p> </blockquote> <p>That is not how I read the spec but ... if that is the case. The epub3/epub2 world must fork gumbo just to keep some bit of sanity since most of the underlying tools are xml parsers and not html5 parser and we need a valid xhtml serialization to work with. </p> <p>Again, I would love to see some documentation anyplace that says it is valid to have a math:td tag when the mathml dtd explicitly forbids it, even inside the nonsense that is the html tag soup spec.</p> <p>Kevin</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <blockquote> <p>That is true in the html namespace not in the svg or mathml namespace where the parser spec falls back to parsing using xhtml rules.</p> <p>Can you point me to the part of the spec that says to use XHTML rules? I've read a decent bit of the parsing spec but definitely not all of it so I probably just missed it.</p> </blockquote> <p>You can see it when it automatically allows non-void self-closing tags to be immediately popped and acknowledged. It is using the xtml parsing rules there not the html parsing rules there.</p> <blockquote> <blockquote> <p>Putting a mathml prefix on a td tag inside of the math tag in no way makes it a valid mathml tag of any kind.</p> </blockquote> <p>I'm not trying to claim that this situation results in a "valid mathml tag".</p> </blockquote> <p>Then why do you assign it to the mathml namespace? It doesn't exists by the very definition of that namespace.</p> <blockquote> <p>My only point is that Gumbo's behavior in this situation matches the behavior of browsers (as demonstrated using the Live DOM Viewer and JSBin), and also matches my (likely flawed) understanding of the spec. If Gumbo really matches both browsers and the spec then it seems like it's doing the right thing.</p> </blockquote> <p>That is an argument I can't fault. If all browsers treat it that way then I guess gumbo probably should too ... <em>but</em> isn't that kind of using the cart to justify the horse and visa-versa. Browsers are just implementing the html5 parsing spec as they read and understood it. Obviously, we both read the same spec and came to widely differing opinions as to what it was saying ... and if you look at the official syntax rules in the spec, it is as clear a "spec" as I have ever seen. Most specs I have dealt with in the past leave much more room for multiple interpretations that want to make you pull your hair out.</p> <p>There are many places in the spec, that I would like to change including how <script /> and <title /> are treated and what a mess the spec makes of them even when the intent of the user is quite clear while at the same time the spec seems to allow almost pure garbage to be acceptable.</p> <p>I need more tags to be recognized, I need to be able to identify true mathml and svg elements and have gumbo miraculously create a DOM tree that when serialized creates a valid xhtml document (ie. a valid xhtml serialization of html5). Since there are already around 150 recognized tags, adding svg and mathml tags brings us to around 255 tags. Given its size, I do not want to search through tag name string lists linearly to assign enums, and I don't want to pass varargs around with up to 79 elements in them and then linearly walk them looking for matches.</p> <p>Thus my changes. I understand they are not right for gumbo. I closed my pull request as soon as I saw the official response.</p> <p>That said .. IMHO gumbo really quickly needs to do the following:</p> <ol> <li>ignore all backwards compatibility issues as they are meaningless and easy to fix and will just hinder rapid development</li> <li>fix the known parsing bugs - especially those that cause gumbo to assert and abort first</li> <li>expand the set of recognized tags so we don't have lots of unknowns flying around</li> <li>incorporate some efficient way to deal with tag searches (with or without namespaces) and with no varargs given the increase in tag size (needed for number 3 above)</li> <li>decide on an error interface (any interface at all as long as it has line numbers, column numbers and some error msg) and get it out there for people to play with and use</li> <li>get fragment parsing working</li> </ol> <p>... and my personal wish list</p> <ol> <li>add an editor interface that will allow the gumbo node tree to be directly edited after parsing (I have some rough code that will do this that I am working on. Of course, it uses non-public interfaces same as my error reporting code does).</li> </ol> <p>It was an enjoyable discussion! I have learned something that html soupiness has even invaded the solid dtd specc'd world of svg and mathml allowing some strange way of adding tags to namespaces when the dtd says they don't exist!</p> <p>Take care,</p> <p>Kevin</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/gsnedders"><img src="https://avatars.githubusercontent.com/u/176218?v=4" />gsnedders</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>HTML doesn't defer to XML in any way for parsing the text/html serialisation format (including within foreign content). Note both HTML and XML parsers can create invalid trees. See the section entitled "Validating and Non-Validating Processors" in the XML spec. A non-validating XML parser has no problem parsing:</p> <pre><code class="language-xml"><?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> ]> <bar>Hello, world!</bar></code></pre> <p>As such, I can write a document as follows:</p> <pre><code class="language-xml"><!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/mathml2.dtd"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <foo/> <html:td xmlns:html="http://www.w3.org/1999/xhtml"/> </math></code></pre> <p>This is invalid, per the DTD. But a non-validating XML parser has no issue with parsing this, and will create a tree with a <code>foo</code> element in the MathML namespace followed by a <code>td</code> element in the HTML namespace, both within the <code>math</code> root element in the MathML namespace.</p> <p>Note that per the DTD even the following is invalid:</p> <pre><code class="language-xml"><!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/mathml2.dtd"> <math:math xmlns:math="http://www.w3.org/1998/Math/MathML"> </math:math></code></pre> <p>…because DTDs don't have any concept of namespaces, you can't bind the namespaces to arbitrary prefixes.</p> <p>As such, just because the (non-validating) XML parser produces a tree doesn't mean it's valid. (Almost all XML parsers are non-validating.) The same is true of HTML. Plenty of inputs to the HTML parser create trees that are invalid, but the parser creates them nevertheless.</p> <p><code><math><td></code> is an example of such an invalid tree — but one created by the parser nevertheless. If you look at the tree construction section (if you're willing to take on trust from me that that sequence produces three tokens: a start tag token whose tag name is "math", a start tag token whose tag name is "td", and an EOF token; if not, you can follow that through the tokenizer section!), you'll see the first token goes through a ton of states triggering numerous parse errors ("The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.") while it builds a tree like:</p> <pre><code>| html | head | body</code></pre> <p>Only at this point, now in the "in body" insertion mode, does it actually start to properly process the token. It hits the 'start tag whose tag name is "math"' case, which inserts a <code>math:math</code> element, such that the tree looks like the following:</p> <pre><code>| html | head | body | math math</code></pre> <p>We then get to the <code>td</code> element. Then looking at the 'tree construction dispatcher' we can see that we 'process the token according to the rules given in the section for parsing tokens in foreign content'. This falls through to the generic 'any other start tag' case, and then we 'insert a foreign element for the token, in the same namespace as the adjusted current node'. As such, we insert a <code>math:td</code> leaving us with the tree as follows:</p> <pre><code>| html | head | body | math math | math td</code></pre> <p>And the EOF token changes nothing. The fact the parser created the tree makes no statement about its validity (it is, for several reasons, invalid).</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/gsnedders"><img src="https://avatars.githubusercontent.com/u/176218?v=4" />gsnedders</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Also note that creating a conformance checker, that checks validity constraints such as the content model of elements, is a fairly significant project. If you're willing to require the input to the valid HTML, you can likely just use the <a href="https://validator.github.io/validator/">Nu Markup Checker</a> (in Java, but the only existing HTML conformance checker) to parse HTML, check it is valid, then serialize it as XML. Actually converting arbitrary input to valid HTML is an even larger project.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <p>Interesting to say the least! I thought gumbo would attempt to build as valid a dom tree as possible from the soup. When faced with a Math start tag in the middle of a table and then a tr, I would hope it closes the math tag (pops it) immediately since the td can not exist in that namespace. You are saying the spec actually says "don't pop the mathml start tag" and happily add children to it even though you know it is will result in an tree that has non-existent elements/ns combinations. My solution enforced the first approach (which is how I interpreted the spec) but you are saying the spec clearly says my interpretation is wrong and not the correct way to build the tree.</p> <p>Isn't that it in a nutshell? </p> <p>If so, then I agree, you must carry along the namespace ids when doing testing so a "pseudo" math:td element should not match as true against a true td outside of the mathml sphere of influence (so to speak).</p> <p>Thanks for driving that home. I have studied the html5 syntax rules many times and did not pick that up, probably because I have lived in an xml / xhtml world with very strict rules.</p> <p>The reason I like gumbo is that it survives when handed garbage even better that BS4 and Tidy can do (without hacking out the html5 parts at the same time). In an ebook editor, we often get neophytes editing xhtml and ending up with crap that needs to be auto-cleaned without losing content to the extent possible. Tidy was killing us in that regard, but the html5 spec seemed to greatly help that and so will gumbo.</p> <p>Thanks!</p> <p>Kevin</p> <blockquote> <p>On Feb 10, 2015, at 11:50 AM, Geoffrey Sneddon notifications@github.com wrote:</p> <p>HTML doesn't defer to XML in any way for parsing the text/html serialisation format (including within foreign content). Note both HTML and XML parsers can create invalid trees. See the section entitled "Validating and Non-Validating Processors" in the XML spec. A non-validating XML parser has no problem parsing:</p> <p><?xml version="1.0" encoding="UTF-8" ?> <! DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> ]> < bar>Hello, world!</bar> As such, I can write a document as follows:</p> <p><!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "<a href="http://www.w3.org/Math/DTD/mathml2/mathml2.dtd">http://www.w3.org/Math/DTD/mathml2/mathml2.dtd</a>"> < math xmlns="<a href="http://www.w3.org/1998/Math/MathML">http://www.w3.org/1998/Math/MathML</a>"</p> <blockquote> <p>< foo /> < html:td xmlns:html="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml</a>" /> </ math> This is invalid, per the DTD. But a non-validating XML parser has no issue with parsing this, and will create a tree with a foo element in the MathML namespace followed by a td element in the HTML namespace, both within the math root element in the MathML namespace.</p> </blockquote> <p>Note that per the DTD even the following is invalid:</p> <p><!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "<a href="http://www.w3.org/Math/DTD/mathml2/mathml2.dtd">http://www.w3.org/Math/DTD/mathml2/mathml2.dtd</a>"> < math:math xmlns:math="<a href="http://www.w3.org/1998/Math/MathML">http://www.w3.org/1998/Math/MathML</a>"</p> <blockquote> <p></ math:math> …because DTDs don't have any concept of namespaces, you can't bind the namespaces to arbitrary prefixes.</p> </blockquote> <p>As such, just because the (non-validating) XML parser produces a tree doesn't mean it's valid. (Almost all XML parsers are non-validating.) The same is true of HTML. Plenty of inputs to the HTML parser create trees that are invalid, but the parser creates them nevertheless.</p> <math><td> is an example of such an invalid tree — but one created by the parser nevertheless. If you look at the tree construction section (if you're willing to take on trust from me that that sequence produces three tokens: a start tag token whose tag name is "math", a start tag token whose tag name is "td", and an EOF token; if not, you can follow that through the tokenizer section!), you'll see the first token goes through a ton of states triggering numerous parse errors ("The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.") while it builds a tree like: | html | head | body Only at this point, now in the "in body" insertion mode, does it actually start to properly process the token. It hits the 'start tag whose tag name is "math"' case, which inserts a math:math element, such that the tree looks like the following: | html | head | body | math math We then get to the td element. Then looking at the 'tree construction dispatcher' we can see that we 'process the token according to the rules given in the section for parsing tokens in foreign content'. This falls through to the generic 'any other start tag' case, and then we 'insert a foreign element for the token, in the same namespace as the adjusted current node'. As such, we insert a math:td leaving us with the tree as follows: | html | head | body | math math | math td And the EOF token changes nothing. The fact the parser created the tree makes no statement about its validity (it is, for several reasons, invalid). — Reply to this email directly or view it on GitHub. </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinhendricks"><img src="https://avatars.githubusercontent.com/u/8493752?v=4" />kevinhendricks</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <p>Luckily, we (the ebook world) have the official epubcheck (versions for 2 and 3) and flightcrew (for epub2) that deal with actual validity checking. Our biggest need is to take a poorly edited xhtml document (we are an ebook editing project and people editing an ebook often make some big mistakes!) and get it well-formed enough to actually load in a Qt QWebView widget to show a preview of what the code does in some sane way without losing any markup. Since the QWebEngine and QWebView are basically browser engines, they shoudl interpret the trees much like gumbo builds them.</p> <p>Thanks,</p> <p>Kevin</p> <blockquote> <p>On Feb 10, 2015, at 11:54 AM, Geoffrey Sneddon notifications@github.com wrote:</p> <p>Also note that creating a conformance checker, that checks validity constraints such as the content model of elements, is a fairly significant project. If you're willing to require the input to the valid HTML, you can likely just use the Nu Markup Checker (in Java, but the only existing HTML conformance checker) to parse HTML, check it is valid, then serialize it as XML. Actually converting arbitrary input to valid HTML is an even larger project.</p> <p>— Reply to this email directly or view it on GitHub.</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/nostrademons"><img src="https://avatars.githubusercontent.com/u/16583?v=4" />nostrademons</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Gumbo's scope & mission, as a project, are to provide an implementation of the HTML5 parsing algorithm, available as a pure C library, that prioritizes correctness > simplicity > features > performance. It is intended to serve as a <em>base</em> for other libraries and tools, and not the tool itself.</p> <p>I chose this largely because there are a number of alternatives out there that fill other niches and make different trade-offs. If you don't have tag soup or the possibility of invalid tags, use an XML parser. If you want correctness and performance but don't care about simplicity, use <a href="http://www.webkit.org/">Webkit</a> or <a href="http://www.chromium.org/blink">Blink</a>. If you want performance and simplicity but don't care about correctness or source positions, use a callback-based parser like <a href="http://www.netsurf-browser.org/projects/hubbub/">Hubbub</a> or <a href="https://github.com/servo/html5ever">html5ever</a>. If you're working in Java there's <a href="https://about.validator.nu/htmlparser/">validator.nu</a>, or in Python there's <a href="https://github.com/html5lib">html5lib</a>.</p> <p>A lot of the more controversial design decisions follow from that. Gumbo does not attempt to validate content models or perform other conformance-checking functions because, ironically, it would make it impossible to build a conformance checker, linter, or refactoring tool if it did. If Gumbo "corrected" the HTML automatically, then such a tool would never see the invalid HTML, and wouldn't be able to report or correct it. Similarly, it wouldn't be possible to do things like write MapReduces to report the prevalence of invalid HTML constructs on the web, or collect statistics on how often users input them. If I were to fold all of these into the error-reporting mechanism of the parser, then Gumbo becomes a monolithic framework like Webkit, and it's unlikely that it would serve <em>anyone's</em> precise needs.</p> <p>Similarly, Gumbo does not provide a mutable parse tree because there is no way for it to do so and still satisfy all of the use-cases where it may be helpful. In particular, memory management becomes very difficult when any node may be pulled out of the parse tree and held onto as a reference, or when new nodes may be created and inserted. Who owns the new & removed nodes? In a scripting language, the nodes should be on the language's heap so they can be garbage-collected. In a browser, you may want to reference count them, or they may be attached to the page's lifetime, or you may want to tie them into your JS engine's GC. In a command-line refactoring tool, you may just want to put everything on an arena and free them all at once.</p> <p>The intent here is for individual tool authors to write the behavior that they need, using Gumbo to solve the specific problem of "the HTML5 parsing algorithm is a pain in the neck to implement and needs to be robust to all sorts of invalid input". Gumbo takes care of the parsing and gives you back the same parse tree as a browser will (along with source locations and parse flags to indicate <em>what</em> the parser did to get it to that state); it's up to you to deal with any incompatibilities between what a browser does and your particular use case. This is also why backwards-compatibility is so important: with a number of tools built on top of Gumbo's API, any change to it may break a substantial amount of code.</p> <p>I don't know much about the epub2 and epub3 specs, but it sounds like your best bet is to use Gumbo to convert the tag soup inputted by authors into a serialized XHTML form, and then use existing XML-based tools to verify the DOM and feed it into Qt webviews. We're willing to make any fixes necessary to make Gumbo into a better HTML5 parser, but making it into a conformance checker or having it generate a parse tree that is not what browsers would generate is explicitly out of scope for it.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>