Closed kevinhendricks closed 9 years ago
I am happy to e-mail the part-1 patch (using binary search and much larger recognized tag enum) to anyone who wants it for review as my own gumbo-parser fork on github has a number of other changes you will not want (working with xhtml parsing, error reporting, etc). Just let me know where to send it if anyone is interested.
Hi, I committed it to my fork so that the commit would show up and you could more easily decide if this is in any way the direction you would like to go.
https://github.com/kevinhendricks/gumbo-parser/commit/ae3b57af872a1eb9788f444a9f7de279f09e9b65
And here is what a parser modified to replace varargs calls with const int sorted lists for tag_in calls looks like. A similar approach can be used for the remaining use of varargs for the in_scope calls as well.
https://github.com/kevinhendricks/gumbo-parser/commit/4882c85ab0ba033d6559ee389e923b8c16732a21
That version has some of my changes for XHTML5 parsing but I would be happy to send a patch against a clean version from today's master if anyone is interested.
Kevin
I'm a little concerned about the impact of this on other language bindings, many of which provide enums for GumboTag that are defined in the scripting language. At the very least, the Python bindings would have to be updated. @craigbarnes, @rubys, @karlwestin, @rgrove, would changing the values of the GumboTag mess things up for the Lua, Ruby, Node bindings or for Sanitize? Any other bystanders have concerns? I do like the idea of the tag lists being more complete, but I don't want to make unnecessary work for the 10+ other language bindings, and I suspect that SVG/MathML parsing is a relatively niche use case compared to HTML and don't want to bloat the API for common usages.
I don't like the usage of numbered const int lists in the other patch. It's important that it be possible to follow along with the spec while reading the code, and this separates the declaration of the list from its usage. I would rather pass a bitset by value and construct it with some macro magic than have to jump between file locations to read every clause of the spec.
It looks to me like Nokogumbo only relies on gumbo_normalized_tagname()
, so I think it would be unaffected (and by extension Sanitize would be unaffected).
Understood about wanting to keep the lists right in the code. I did not think you could pass an initializer as an argument, so I decided to move them to the top of each routine to try to keep them as close as possible to their use point. I will try to see if there is any macro definition that would help in this case.
My use case is epub2 and epub3 so svg is common to expand cover images and mathml is required for the epub3 spec. I was able to fix your mismatched ns tag case where mathml is used in an html table using an extra if to check if that tag really was part of svg or mathml before making it part of those namespaces in handle_token. There are no mathml td or tr elements. So it is easy to detect.
So doing binary searches an a flat enum of tags is a viable solution. Our project Sigil will use that approach since we need xhtml parsing of non-void but self-closed tags as well and so must maintain our own version of gumbo anyway.
Thanks for looking at it.
Kevin
As it turns out you can pass an initializer list in an argument in C99 if you cast it properly.. In other words, you could do the following in an argument to a fuction that expects an int pointer in that position:
tag_in(token, (int []) { TAG1, TAG2, TAG3, TAG4 }, cnt);
and
nodetag_in(node, (int []) { TAG1, TAG2, TAG3, TAG4 }, cnt);
I will give that a try for our own version.
FWIW, here is the commit to putting back the now sorted taglists back where they belong oin the code using the C99 initializer as argument approach:
https://github.com/kevinhendricks/gumbo-parser/commit/ea248f03bcbf6ce272e8269198add05dda00a138
I've thought a bit more about this in context of some of the other upcoming patches that will likely go in before 1.0.
I do want to extend the tag list to include common SVG and MathML tags. It'll make the parser more complete and a bit easier to use for future projects that wish to use this. Plus, a couple new tags (notably
However, I want to honor backwards-compatibility promises and make migration as easy as possible. A quick survey of the language bindings that I know of showed that D, Lua, C#, and Python were dependent upon the enum list. Most of these copied source code in directly (often, their FFI is structured as a C parser that automatically extracts the language binding information from a C header file), so they won't immediately break, but I'd still like to minimize arbitrary migration work. So there are a couple restrictions on how new tags can go in:
@aroben's patch #286 is going in, so you probably don't want to do major surgery to any of the node_tag_in stuff until after the HTML_QN stuff goes in. It's not sufficient (in HTML5) to detect HTML vs. SVG vs. MathML in the tokenizer, because the same tag may be in a different namespace depending on where in the DOM it is. For example,
Regarding nokogumbo, I'm not overly worried about breaking changes. If you make such a change, I'll keep up. But since you are planning future releases, fragment parsing and error reporting items from the wish list are both of interest.
FWIW, personally, I would not worry about downstream breakage at all. It is a trivial fix period. It is better to get it updated now imho. Hell it took me all of 5 seconds to update the gumboc.py code.
And if you look at the patch, it does examine the current node of the tree in the DOM in parser.c in handle_token. It simply checks to see if a tag on the token could actually be part of svg or mathml namespaces only after all other tests are done before assigning it to be handled as foreign content. Svg and mathml are xml based namespaces and not html soup. They have a clearly defined set of tags period. The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces. This is easy to check and prevents pure html tags from being pushed incorrectly into foreign content. As for tag overlap, there are only 5 tags that exist in common in the svg and html namespaces. They are "a, title, script, font, style". If you read the html5 spec it is written understanding that overlap. For example see how the spec checks the attributes of font to see if it is part of html or svg. The same goes with how title is used in the spec and script and style. The spec itself was written understanding those specific cases. Once they are handed, the tag name is enough to determine if it is legal in the svg or mathml namespaces.
So would you please cite one instance in the spec where the tag itself is not enough information given how the conditions are written? I could not find any but I must be missing it. Even the integration points code properly handles cases in mathml and svg when an html tag may happen. The spec never says check mo in the mathml namespace, it simply says to check the tag is mo.
And, fwiw, it is simply not good coding to use an uintptr type for an enum just so a int value can be stored in a void pointer in a gumbo vector to generate a self-growing list of what are effectively ints. There is no type safety in that. In C in general with all of its void pointers flying around there is no real type safety but making an enum a pointer type just to enable it to be cast to a void* and back is truly not a good idea.
As for correctness vs speed, you are right, correctness should come first, Correctness should also trump any API concerns as well and, backwards compatibility should not be a concern until correctness is reached. All of that said, it makes no sense to ignore simple speedups the don't impact correctness.
Thank you for at least considering the patch. I will of course split out the simple other fixes you want and pass them along as separate pull requests once I get a free moment. We (Sigil) will simply go with our fork of gumbo until these issues are addressed on your side in some way. I do encourage you to dismiss your backwards compatibility concerns on a pre 1.0 version of a library whose spec is constantly changing underneath you since the downstream fixes are trivial and simply not worth worrying about. Instead, I would focus on releasing early and often with quick improvements and bug fixs .... but that is of course your call not mine.
Take care,
Kevin
The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces.
I don't think this is quite right. I believe a math:td
node is actually what the HTML parsing spec expects in this case. You can see this in the Live DOM Viewer, which represents elements in the HTML namespace with uppercase tag names and elements in other namespaces with lowercase tag names. Or you can see it in this JSBin.
Again, td does not exist in the namespace for mathml. Check out its DTD (http://www.w3.org/Math/DTD/Overview.html). The html5 spec only uses Presentation MathML which is even a subset of that DTD. If you look closely you will see Presentation Mathml has it own table elements mtr, and mtd. So unless you are rewriting MathML's DTD on the fly, the tag "math:td" simply does not exist and can never be valid mathml.
On Feb 10, 2015, at 9:26 AM, Adam Roben notifications@github.com wrote:
The parser should never have created a math:td node in the failing testcase since td is not a legal tag in the svg or mathml namespaces.
I don't think this is quite right. I believe a math:td node is actually what the HTML parsing spec expects in this case. You can see this in the Live DOM Viewer, which represents elements in the HTML namespace with uppercase tag names and elements in other namespaces with lowercase tag names. Or you can see it in this JSBin.
— Reply to this email directly or view it on GitHub.
I definitely believe that it results in invalid MathML. But it's possible to create all kinds of non-conforming documents using HTML. It's how browsers behave, and so it's how the spec is written.
Hi,
That is true in the html namespace not in the svg or mathml namespace where the parser spec falls back to parsing using xhtml rules. Putting a mathml prefix on a td tag inside of the math tag in no way makes it a valid mathml tag of any kind. There are places you can mix in valid html tags into both svg and mathml namespace (but still valid html tags) which is why the spec has all of the "integration point" rules.
Kevin
On Feb 10, 2015, at 10:01 AM, Adam Roben notifications@github.com wrote:
I definitely believe that it results in invalid MathML. But it's possible to create all kinds of non-conforming documents using HTML. It's how browsers behave, and so it's how the spec is written.
— Reply to this email directly or view it on GitHub.
<math><td>
should definitely create an element whose local name is td
in the namespace http://www.w3.org/1998/Math/MathML
. The td token is parsed according to the rules for parsing tokens in foreign content, and it's not one of the tokens that causes one to break out of foreign content.
The integration points exist to support pre-existing structures that already exist (like annotation-xml
in MathML/SVG); they have nothing to do with invalid elements.
MathML and SVG within HTML can create invalid elements in the MathML/SVG namespaces, much like how their XML serializations can do so (at least with a non-validating parser).
That is true in the html namespace not in the svg or mathml namespace where the parser spec falls back to parsing using xhtml rules.
Can you point me to the part of the spec that says to use XHTML rules? I've read a decent bit of the parsing spec but definitely not all of it so I probably just missed it.
Putting a mathml prefix on a td tag inside of the math tag in no way makes it a valid mathml tag of any kind.
I'm not trying to claim that this situation results in a "valid mathml tag". My only point is that Gumbo's behavior in this situation matches the behavior of browsers (as demonstrated using the Live DOM Viewer and JSBin), and also matches my (likely flawed) understanding of the spec. If Gumbo really matches both browsers and the spec then it seems like it's doing the right thing.
Hi,
On Feb 10, 2015, at 10:13 AM, Geoffrey Sneddon notifications@github.com wrote:
Hi, I have modified gumbo-parser to expand the TagNames to include the full set of presentation elements in MathML and the full SVG tag list. (according to the latest spec). Given the size of the new tag list, a binary search is needed to quickly assign GumboTag enums.
But since we now can work with sorted lists of integers and complete sets of SVG and MathML tags we can easily use binary search on sorted integer lists to detect if a tag is a valid MathML tag (ie. part of the presentation subset) or a valid SVG tag.
With this capability in place we can convert all of the calls to tag_in in parser.c and its related cousins to properly handle any C pre-processor determined list of tags (no more varargs needed) and to use binary search to determine if a tag is in that list.
With this and the capability to quickly determine if something is a valid SVG or MathML element, you could easily modify the parser to properly prevent the confusion that comes when mislabling html tags as mathml or svg tags. which caused a number of issues.
I have completed part one of this where I modify src/tag.c, add src/tag.h, and update the gumbo.h tag enum or expand its contents to properly match the list in tag.c.
Before I go to the trouble of converting all of the varargs calls to tag_in and its relatives in parser.c to use pre-determined tag lists (pre-sorted), I was hoping someone would look at the part 1 patch and let me know if they are at all interested.