rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

HTML processing should use a HTML to XML parser #31

Open rhdunn opened 11 years ago

rhdunn commented 11 years ago

Due to HTML quirks, the processing for HTML and XHTML content (including HTML without xmlns, but with an XML processing instruction) should:

  1. Use the xmlreader class to read the HTML tags, specifying the HTML entities;
  2. Pass the correct implicit close tag flag to the tags that require it (meta, img, br, etc.);
  3. Use the correct implied tag rules;
  4. Map the HTML, SVG and MathML tags to the correct namespaces.

After this, the HTML markup will be in a form that can be processed as XML using the generic XML content processor via CSS.

This requires the current XML reader to be reworked to support extensions.

The current html_reader will be renamed xhtml_reader and a html_reader extending the current xml_reader implemented in the xmlreader.hpp file. This allows the HTML to XML formatting to be tested (via a tidy test application akin to the HTMLTidy application). The tests for this should reside in the tests/html/tidy directory.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1026791-html-processing-should-use-a-html-to-xml-parser?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github).