metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 2 forks source link

It seems to add unnecessary white space after math to stem conversion #89

Open skalee opened 3 years ago

skalee commented 3 years ago

Some weird and unnecessary (?) space is added before and after stem[...]:

https://github.com/metanorma/reverse_adoc/blob/99ec3a1164289f9fc2736d770406d9b09e20f787/lib/reverse_adoc/converters/math.rb#L14

For example:

require "reverse_adoc"
s = "Converted <math><mi>x</mi></math>, but some odd spacing was added."
ReverseAdoc.convert(s, mathml2asciimath: true).strip
# => "Converted stem:[x] , but some odd spacing was added."
ReverseAdoc.convert(s, mathml2asciimath: false).strip
"Converted stem:[<math><mi>x</mi></math>] , but some odd spacing was added."

I have strong suspicion that this is an intentional tradeoff to avoid gibberish in situations like something<math></math> which would be otherwise converted to somethingstem:[]. But for short inline formulas this is somewhat harmful, as illustrated above. That space before a comma is not only ugly, but may also cause a line break.

I have two ideas how to fix the issue:

  1. Using a different delimiter, perhaps zero width space, or zero width non-break space, or zero width joiner. Note that these aren't actually spaces, they belong to formatting (Cf) rather than space (Zs) category. I'm not sure of implications. See: https://jkorpela.fi/chars/spaces.html
  2. Some smart detection whether space is really needed or not. If this is difficult at parse time, then we can temporarily insert some private-use characters or sentinels, and replace them during post-processing.

@opoudjis Thoughts?

ronaldtse commented 3 years ago

@skalee you are right:

I have strong suspicion that this is an intentional tradeoff to avoid gibberish in situations like something which would be otherwise converted to somethingstem:[].

This was a fix made by @w00lf .

I am not sure whether a zero-width something before stem:[...] will work.

I think smart detection works better and cleaner.

skalee commented 3 years ago

It works:

[1] pry(main)> require "asciidoctor"
=> true
[3] pry(main)> s = "Some math\uFEFFstem:[x]\uFEFF."
=> "Some mathstem:[x]."
[4] pry(main)> Asciidoctor.convert(s)
=> "<div class=\"paragraph\">\n<p>Some math\\$x\\$.</p>\n</div>"
[5] pry(main)> out = Asciidoctor.convert(s)
=> "<div class=\"paragraph\">\n<p>Some math\\$x\\$.</p>\n</div>"
[6] pry(main)> out.split("\uFEFF")
=> ["<div class=\"paragraph\">\n<p>Some math", "\\$x\\$", ".</p>\n</div>"]
opoudjis commented 3 years ago

ZWJ is clever, FWIW.

ronaldtse commented 3 years ago

Then let’s go for it. Thanks @skalee !

skalee commented 3 years ago

~Note that using ZWJ will result with odd-looking sequences of ZWJ and a regular space or ZWJ and line feed in plenty of cases. (Though I can sanitize that in post-processing if needed.)~

EDIT: Scratch that. We have that already, it just needs a few tweaks.

https://github.com/metanorma/reverse_adoc/blob/99ec3a1164289f9fc2736d770406d9b09e20f787/lib/reverse_adoc/cleaner.rb#L22-L24

skalee commented 3 years ago

The more I think about this one, the more doubts I have.

Concepts will be edited in Paneron. I'm worried that injecting such non-printable characters may turn out to be a poor idea. Unless Paneron will be clever enough to hide ZWJs from users.

On the other hand, removing excessive white spaces from around stem macro will be much easier than I initially thought.

cc @strogonoff

hmdne commented 1 month ago

We could apply a similar logic to what we have here:

https://github.com/metanorma/coradoc/blob/main/lib/coradoc/reverse_adoc/converters/a.rb#L27-L34

Ie. add spaces only when they are really needed.