kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

handle_in_foreign_content whatwg spec causes fragment parsing loop infinitely when a mathml integration point exists at key points in the fragment #29

Closed kevinhendricks closed 1 year ago

kevinhendricks commented 1 year ago

I have been updating sigil-gumbo and ran into this as well.

The "solution" appears to be to replace this snippet from parser.c in handle_in_foreign_content:

    /* Parse error */
    parser_add_parse_error(parser, token);

 GumboNode *current_node;
    while ((current_node = get_current_node(parser)) && !(
                is_mathml_integration_point(current_node) ||
                is_html_integration_point(current_node) ||
                current_node->v.element.tag_namespace == GUMBO_NAMESPACE_HTML
    )) {
        if (!pop_current_node(parser)) break;
    }

    parser->_parser_state->_reprocess_current_token = true;
    return false;
  }

with this:

    /* Parse error */
    parser_add_parse_error(parser, token);

    while( !(is_mathml_integration_point(get_current_node(parser)) ||
             is_html_integration_point(get_current_node(parser)) ||
             get_current_node(parser)->v.element.tag_namespace == GUMBO_NAMESPACE_HTML) ) {

        pop_current_node(parser);
    }
     return handle_html_content(parser, token);
  }

This prevents the reprocessing from ending up back in handle_in_foreign_content causing the infinite loop stuck on a mathml integration point that will never get popped.

Instead or reprocessing it uses handle_html_content and returns it return value.

It effectively just uses the same approach as lower down in that same routine as shown here:

    // We can't call handle_token directly because the current node is still in
    // the SVG namespace, so it would re-enter this and result in infinite
    // recursion.
    return handle_html_content(parser, token) && is_success;

With these changes in place sigil-gumbo now passes all of the html5lib tests (including fragment tests) for tree-construction in TODAYS html5lib testing master (once they fixed that silly template mistake they made).

Hope this helps.

kovidgoyal commented 1 year ago

That still causes two tests to fail

======================================================================
FAIL: test_foreign-fragment_2 (test.html5lib_adapter.ConstructionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kovid/work/html5-parser/test/html5lib_adapter.py", line 164, in test_func
    return self.implementation(inner_html, html, expected, errors, test_name)
  File "/home/kovid/work/html5-parser/test/html5lib_adapter.py", line 207, in implementation
    self.ae(expected, output, error_msg + '\n')
AssertionError: '| <font>\n|   color=""\n| "X"' != '| <font>\n|   color=""'
  | <font>
- |   color=""
?             -
+ |   color=""- | "X" : 

Test name:
foreign-fragment

Input:
<font color></font>X

Expected:
| <font>
|   color=""
| "X"

Received:
| <font>
|   color=""

======================================================================
FAIL: test_foreign-fragment_3 (test.html5lib_adapter.ConstructionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kovid/work/html5-parser/test/html5lib_adapter.py", line 164, in test_func
    return self.implementation(inner_html, html, expected, errors, test_name)
  File "/home/kovid/work/html5-parser/test/html5lib_adapter.py", line 207, in implementation
    self.ae(expected, output, error_msg + '\n')
AssertionError: '| <svg font>\n| "X"' != '| <svg font>'
- | <svg font>
?             -
+ | <svg font>- | "X" : 

Test name:
foreign-fragment

Input:
<font></font>X

Expected:
| <svg font>
| "X"

Received:
| <svg font>
kevinhendricks commented 1 year ago

I am not seeing any failures with my local sigil-gumbo version. I see you found and fixed your remaining issue be serializing the node tail if present.