glossarist / iev-data

1 stars 1 forks source link

Switch to plurimath/html2math #149

Open skalee opened 3 years ago

skalee commented 3 years ago

In order to reduce code duplication in projects, extract logic to another gem. It looks like the most up-to-date version is here: https://github.com/metanorma/stepmod-utils/blob/728bd50bf609afd6c7ef0a6848f45a8419a57819/lib/stepmod/utils/html_to_asciimath.rb.


Extracted from #144:

@skalee we have copied of the 'fake math conversion' code to here: https://github.com/metanorma/stepmod-utils/blob/728bd50bf609afd6c7ef0a6848f45a8419a57819/lib/stepmod/utils/html_to_asciimath.rb

And this is probably time to extract out this 'fake math conversion' functionality to a separate gem under the Plurimath umbrella. Can you help with that? Thanks.

skalee commented 3 years ago

@ronaldtse If you got some test suite by chance, or some technical description of the input format, that would be very helpful.

skalee commented 3 years ago

I'll do my best, but it won't be very reliable. For example https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-02-13 — I can probably detect and convert j<i>b</i>, but I won't detect lone j which also happens in the definition.

ronaldtse commented 3 years ago

@skalee unfortunately I don't have a set of compiled examples. There are definitely enough examples from the IEV, and I know that @w00lf has encountered some formulas that required work on top of the original code, perhaps he has some specs/examples to provide.

skalee commented 3 years ago

Thanks! I've extracted some from IEV. If @w00lf has a set of troublesome examples, it would be great to check them too.

skalee commented 3 years ago

@ronaldtse Short follow-up:

I'm doing fine with converting HTML math expressions to AsciiMath. It's certainly doable and I've already developed a tool which supports many features they use in IEV.

The difficult part is telling HTML math from rich text apart. It's easy for a human but not necessarily for a computer. Detecting numbers isn't reliable, they may be used in different contexts. Detecting operators isn't reliable, because minus can be confused with dash. Detecting <i> isn't reliable, because this tag isn't used exclusively for math (e.g. in https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=845-32-051). And so on. I believe that I can invent some heuristics and I'm working on that now, but this may be unable to detect some simplest formulas.

But perhaps it isn't needed at all? Perhaps we can keep HTML math as rich text, and IEC will gradually convert them to formulas during their ongoing work on these concepts? I know that it will take years. The question is if they really need anything more than that. And we need rich text conversion from HTML to AsciiDoc anyway.

ronaldtse commented 3 years ago

But perhaps it isn't needed at all? Perhaps we can keep HTML math as rich text, and IEC will gradually convert them to formulas during their ongoing work on these concepts? I know that it will take years. The question is if they really need anything more than that. And we need rich text conversion from HTML to AsciiDoc anyway.

We have the following agreement with the IEV team on semantic enrichment:

Given that it is very difficult to bring semantic enrichment to 100%, I think best effort is acceptable.

We have to further consider that any "units" used in the IEV should also be converted into semantic units, i.e. UnitsML.

For now let's delegate the decision on what "good enough" in math means here to you, since you are knee deep in this 😉

skalee commented 3 years ago

Then I guess heuristics will do.

skalee commented 3 years ago

I'm pretty sure that some concepts need to be fixed, otherwise we'll end up with nasty false positives. One example is https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-03-30, this fragment precisely: forefinger(<b><i>V</i></b>). Because of there is no space before ( it looks like a function call. I hope that I'll be able to provide a list of required fixes someday in future.

ronaldtse commented 3 years ago

In this case the heuristic could know that “forefinger” is too long for a math symbol, but it’s no way a great rule. Let us also report this to IEC.

skalee commented 3 years ago

Length checks will not work. There are formulas which would be broken this way, for example this one in in https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=112-01-13:

dim(refractive index <i>n</i> = <i>c</i><sub>0</sub>/<i>c</i>) = (LT<sup>–1</sup>)<sup>0</sup>"
skalee commented 3 years ago

@ronaldtse What to do with lone Greek letters which aren't part of longer mathematical formulas like in following example (https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=103-07-03):

the angular frequency is <i>&omega;</i>.
  1. Should they be converted to stem:[omega]?
the angular frequency is stem:[omega].
  1. Should they be left in HTML entity syntax, which is recognized in AsciiDoc (see https://docs.asciidoctor.org/asciidoc/latest/subs/replacements/)?
the angular frequency is _&omega;_.
  1. Or should they be converted to a regular Greek letter?
the angular frequency is _ω_.
  1. Or maybe there is some other idea how to handle them?
ronaldtse commented 3 years ago

dim(refractive index n = c0/c) = (LT–1)0"

True, this probably will require manual conversion.

What to do with lone Greek letters which aren't part of longer mathematical formulas

They should be converted to normal stem:[xxx]. In the future we may further enrich them.

skalee commented 3 years ago

@ronaldtse What to do if given formula cannot be represented in AsciiMath, typically due to unsupported symbols? For example in 103-03-01:

Note 3 to entry: Notation H(<i>x</i>) is also used. Notation &thetasym;(<i>t</i>) is used for the unit step function of time. Notation &upsih;(<i>x</i>) has also been used.
HTML LaTeX AsciiMath
thetasym vartheta vartheta
upsih varUpsilon ???

Fallback to MathML, perhaps? Using LaTeX or Unicode? Or do you have some better idea? Opening a feature request in AsciiMath may work too in a long run.

skalee commented 3 years ago

Perhaps a better question would be: Given that AsciiMath is generally preferred but unsuitable for more complicated formulas, which syntax should be supplemental: LaTeX or MathML?

skalee commented 3 years ago

Short follow-up. The current plan is:

  1. Convert HTML math to AsciiMath with our converter.
  2. Convert AsciiMath to MathML with AsciiMath gem.
  3. Convert it back to AsciiMath.

While it sounds odd, there is a rationale for that.

ad 1. HTML math is sequential in its nature, AsciiMath is sequential too, MathML is more structural. It's far easier to convert HTML math to AsciiMath than to MathML. My almost-complete-converter to AsciiMath is simpler and smaller than my work-in-progress-converter to MathML.

ad 2. However, AsciiMath does not support some of the features used in MathML, especially special characters which need to be written in Unicode rather than using their English names composed of ASCII characters. That's why some HTML math formulas cannot be represented as AsciiMath in the easy-to-edit form. Or maybe I'm wrong and stem:[ϒ] is okay — but this is not English Y nor Greek upsilon, this is "ϒ Greek upsilon with hook symbol".

ad 3. However, AsciiMath is easier for users, and we want to have AsciiMath when possible. That's why we'll try to convert it back to AsciiMath and use some other notation when it's impossible.

ronaldtse commented 3 years ago

@skalee full agree with the statements. Steps 2-3 will normalize the asciimath so it’s good.

ronaldtse commented 2 years ago

This task will depend on the plurimath gem: https://github.com/plurimath/plurimath/issues/2