sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.67k stars 99 forks source link

Using entities in XML and MathML #2144

Open Omikhleia opened 1 month ago

Omikhleia commented 1 month ago

This relates to MathML, but raises some more interesting points regarding the "general" parsing of XML (#2111)...

Context

MathML in SIL-XML, with formula obtained from an external source...

<document>
<mathml mode="display">
  <mrow>
    <mrow>
      <mo>&lang;</mo>
      <mrow>
        <mi>&psi;</mi>
        <mspace width="0.17em" />
        <mrow>
....

SILE (well actually our lxp parser) errors: ! undefined entity at math-showcase/mathml/joe10.xml

Workarounds

How to possibly support MathML formula using HTML/MathML entities, the list of which is quite big. Did I say "big"? (the latter even has a discussion on phi / varphi etc.)...

So...

  1. One can replace all (HTML) entities from the MathML original file (either by their symbol or their &#xXXXX code point... but it's cumbersome and tedious in any reasonable workflow...
  2. One can hack the inputter so as to search-and-replace entities before XML parsing... but it's crazy performance-wise and sounds rather dumb (having to substitute strings in a whole document, before parsing it? No way!)
  3. One can add a DOCTYPE to the document, such as:
    <!DOCTYPE document [
      <!ENTITY times "&#x00D7;">
      <!ENTITY lang "&#x27E8;">
      <!ENTITY psi "&#x03C8;">
     ...
    ]>

    ... But it's also crazy and cumbersome.

  4. One can hack the inputter to stuff that big DTD automatically at the top of the content before parsing... But that's not ideal too performance-wise (to have lua-expat parse again and again the same in-text DTD...)

A real solution?

The key point here is to enforce NotStandalone, and provide a SkippedEntity handler that does the replacements with a table... Extensible, flexible, clever performance-wise, and still allowing explicit DTD entity declaration as override.

But of course, we don't want to do this for any random XML document. Those might have their own entities, not the HTML ones... And some of the ideas mentioned in #2111 (dedicated XML inputters with possibly other schema-based rules on space handling etc.) is perhaps even more sound than ever...

Any opinions on the topic, before I start hacking as a madman ? ;)

(EDIT: Fixed the SkippedEntity code example)