rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Finish the fragment parser implementation #97

Closed stevecheckoway closed 6 years ago

stevecheckoway commented 6 years ago

It's not enough to parse a fragment based on a known tag and namespace. The the five pieces of information required are

  1. the context element tag name;
  2. the context element tag namespace;
  3. the value of the context element's encoding attribute (when the element is an MathML annotation-xml element);
  4. the quirks mode of the host document; and
  5. the form element pointer.

The encoding attribute of an annotation-xml context element determines if the content should be parsed as HTML or as foreign elements. See https://html.spec.whatwg.org/multipage/parsing.html#html-integration-point

Broken DOCTYPE declartions can put the document in quirks mode in addition to specific public and system identifiers. libxml2 has no way to record the force-quirks flag https://html.spec.whatwg.org/multipage/parsing.html#force-quirks-flag Fortunately, the quirks mode plays very little role in parsing.

Finally, if the fragment context is a form element (or has one as an ancestor), then <form> and </form> tags (among other things) are parse errors and the tags are ignored.

stevecheckoway commented 6 years ago

@craigbarnes This has some fixes to quirk mode detection that I probably should have split out. The public/system identifiers should always be compared case-insensitively to the strings in the list. It's just that some of them need only be a prefix (those that aren't exact).

craigbarnes commented 6 years ago

@stevecheckoway Thanks for the heads up. I still need to merge a few of your other patches before this one I think.