lutaml / lutaml-model

LutaML Model is the Ruby data modeler part of the LutaML data modeling suite. It supports creating serialization object models (XML, YAML, JSON, TOML) and mappings to and from them.
Other
2 stars 2 forks source link

Parsing HTML entities in XML using Nokogiri as adapter #154

Open suleman-uzair opened 1 week ago

suleman-uzair commented 1 week ago

Nokogiri gem doesn’t handle HTML entities other than &, < ,> , " , and ', the rest of the entities are ignored/replaced, but they are valid input in MathML.

Issue faced while MathML parsing in https://github.com/plurimath/mml/pull/2#:~:text=I%20added%20%22Ox%22%20as%20dependency%20because%20the%20Nokogiri%20gem%20doesn%E2%80%99t%20handle%20HTML%20entities%20other%20than%20%26%2C%20%3C%20%2C%3E%20%2C%20%22%20%2C%20and%20%27.

@ronaldtse @HassanAkbar should we consider Ox for this issue or is this implementable in Lutaml-Model?

ronaldtse commented 1 week ago

I believe Nokogiri supports only formal XML entities. However for MathML to be built on XML, it should support XML entities?

Why do we have to use any HTML entities when we can use the character codes?

suleman-uzair commented 6 days ago

Why do we have to use any HTML entities when we can use the character codes?

@ronaldtse, we do not need to use HTML entities, but MathML editors (MathJax for example) does support HTML entities and some examples also contain HTML entities (&sum; and &prod; for example). Also, &micro; is available in the prefixes.yaml file in UnitsDB for HTML reference, which is used for MathML conversion in Unitsml-Ruby.

ronaldtse commented 6 days ago

I see, so this is purely for supporting bad XML (bad MathML editors): MathML that contains HTML entities.

When Plurimath parses HTML or MathML, sure it can accept HTML entities. But when it outputs MathML, there is no reason for it to output HTML entities, which is unsupported in XML.

I don’t know how we can make Nokogiri support them, in my memory the Nokogiri HTML parser is needed.

opoudjis commented 6 days ago

HTML Entities have caused me issues in the past, because they will turn up in markup and they are not guaranteed to be supported by Nokogiri at all: I did indeed need to use the Nokogiri HTML parser in Metanorma, and when Nokogiri forced me to stop doing so, I instead converted all HTML entities in Metanorma Asciidoc to XML entities in preprocessing: https://github.com/metanorma/metanorma-iso/issues/666

And HTML entities will turn up in markup. Declining to support them in reading documents is not an option.