untitaker / html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser
MIT License
148 stars 11 forks source link
html html5 lexer parser parsing sax tokenizer whatwg xml

html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

What a tokenizer does and what it does not do

html5gum fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer test suite. Since it is just a tokenizer, this means:

With those caveats in mind, html5gum can pretty much ~parse~ tokenize anything that browsers can.

The Emitter trait

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

See the custom_emitter example for how this looks like in practice.

Other features

Alternative HTML parsers

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

Etymology

Why is this library called html5gum?

License

Licensed under the MIT license, see ./LICENSE.