servo / html5ever

High-performance browser-grade HTML5 parser
Other
2.14k stars 222 forks source link

html5ever parses document in unexpected way #513

Closed kod-kristoff closed 1 year ago

kod-kristoff commented 1 year ago

I noticed that this document got parsed by html5ever in a unexpected way:

<html>
 <head></head>
 <body>
  <p>one
   <p>two</p>
   three
  </p>
 </body>
</html>

but when the following example:

use markup5ever_rcdom as rcdom;
use rcdom::{NodeData, RcDom};
use html5ever::tendril::TendrilSink;

fn main() {
    let source = "<html><head></head><body><p>one<p>two</p>three</p></body></html>";
    let dom: RcDom =
        html5ever::driver::parse_document(RcDom::default(), Default::default()).one(source);

    // Do some processing
    let doc = &dom.document;
    let root = &doc.children.borrow()[0];
    print_tree(root, 0);

    if !dom.errors.is_empty() {
        println!("\nParse errors:");
        for err in dom.errors.iter() {
            println!("    {}", err);
        }
    }
}

fn print_tree(node: &rcdom::Handle, level: usize) {
    let padding = format!("{empty: >width$}", empty = "", width = level);
    match &node.data {
        NodeData::Element {
            name,
            attrs,
            template_contents,
            mathml_annotation_xml_integration_point,
        } => {
            println!(
                "{padding}<{}> num_children={}",
                &name.local,
                node.children.borrow().len()
            );
            for i in 0..node.children.borrow().len() {
                let child = &node.children.borrow()[i];
                print_tree(child, level + 1);
            }
            println!("{padding}</{}>", &name.local,);
        }
        NodeData::Text { contents } => println!("{padding}{}", contents.borrow().as_ref()),
        _ => todo!(),
    }
}

This outputs

<html> num_children=2
 <head> num_children=0
 </head>
 <body> num_children=4
  <p> num_children=1
   one
  </p>
  <p> num_children=1
   two
  </p>
  three
  <p> num_children=0
  </p>
 </body>
</html>

Parse errors:
    Unexpected token
    No <p> tag to close

but I expected that three should be contain in a <p> elem, like below:

 <head> num_children=0
 </head>
 <body> num_children=1
  <p> num_children=3
   one
   <p> num_children=1
    two
   </p>
   three
  </p>
 </body>
</html>
roelandvanbatenburg commented 1 year ago

https://stackoverflow.com/a/12015809/1014666 explains why you cannot nest paragraphs.