jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.06k stars 3.35k forks source link

Feature request: convert \mathbf symbols in latex math to unicode representations instead of capitalizing #5741

Open kzvi opened 5 years ago

kzvi commented 5 years ago

Current behavior:

$ pandoc -t plain <<< '$A \mathbf x = \lambda \mathbf x$'
AX = λX

Desired behavior:

$ pandoc -t plain <<< '$A \mathbf x = \lambda \mathbf x$'
A𝐱 = λ𝐱

This would be useful because it makes the output a more accurate / recognizable representation of the formula. Pandoc already does this for \mathbb symbols. Unicode has characters specifically for this purpose listed here.

mb21 commented 5 years ago

Interesting proposal... any idea what the font support for those characters is on Windows and Linux? (I can see them fine on macOS)....

btw. probably this issue is part of https://github.com/jgm/texmath

alerque commented 5 years ago

They display in the browser and terminal just fine for me on Linux. This sounds like a good idea to preserve more data between formats. Unless ASCII output is desired using proper Unicode representations that are there for exactly these meanings seems like the right thing to do.

jgm commented 5 years ago

Texmath has Text.TexMath.Writers.Pandoc, which pandoc uses for the default math translations. This has

renderStr :: TextType -> String -> Inline
renderStr tt s =
  case tt of
       TextNormal       -> Str s
       TextBold         -> Strong [Str s]
       TextItalic       -> Emph   [Str s]
       TextMonospace    -> Code nullAttr s
       TextSansSerif    -> Str s
       TextDoubleStruck -> Str $ toUnicode tt s
       TextScript       -> Str $ toUnicode tt s
       TextFraktur      -> Str $ toUnicode tt s
       TextBoldItalic    -> Strong [Emph [Str s]]
       TextSansSerifBold -> Strong [Str s]
       TextBoldScript    -> Strong [Str $ toUnicode tt s]
       TextBoldFraktur   -> Strong [Str $ toUnicode tt s]
       TextSansSerifItalic -> Emph [Str s]
       TextSansSerifBoldItalic -> Strong [Emph [Str s]]

So as you can see a toUnicode transformation is done for e.g. fraktur, but for boldface, the pandoc Strong element is used.

If we changed this, it would change not just plain output but all formats, e.g. in HTML you'd get unicode boldface characters instead of a <strong> tag. That might not be so bad, if font support is consistent enough.

Other alternative would be to change the plain writer's handling of Strong, so that it uses unicode boldface when possible, instead of capitalizing. This would affect not just math but everything.

Third possibility would be to change handling of Strong in plain writer, but only in math contexts.

kzvi commented 5 years ago

I think that there is a sense in which translating bold math variables into <strong> tags is semantically incorrect, since <strong> is designed to mean "important" whereas <b> is designed to mean "bold".

jgm commented 5 years ago

Yes, that's right technically. But as a practical matter, the point of this module is to convert math into something that will render in EVERY format pandoc supports. Pandoc's data model has a Strong constructor, so by using that we can get decent results in every format.

But we could change things in any of the three ways outlined above.

hftf commented 5 years ago

Let’s step back a moment to see the big picture. Task: Convert math notation to another format. There are three scenarios, depending on the type of format. In order of increasing “lossiness”:

  1. Formats with specific markup for math notation. Examples:
    • LaTeX, MathJax, KaTeX: $\lambda \mathbf x$
    • MathML, OMML: <math><mi>λ</mi><mi mathvariant="bold">x</mi></math>
  2. Formats with generic markup only can approximate some math notation. Examples:
    • HTML: λ<strong>x</strong>
    • Markdown: λ**x**
  3. No markup. In plain text, generic markup is not an option, so Unicode is the last resort: λ𝐱

Each type of format also supports the capabilities below it.

Some generic markup formats support embedding specific markup via plugins, but that capability is not really part of the format itself. Example: HTML or Markdown + MathJax: $\lambda \mathbf x$.


With that in mind, let’s evaluate your three options.

Your third option is best. It would require a new function akin to renderStr that emits Unicode but does not try to emit any markup like Strong or Code, since they are unrepresentable in plain text.

Your first option is okay. I could imagine some users wanting the choice (a command-line switch?) to “abuse” semantic tags like <strong>. Fonts with Unicode math support are now widespread, but it depends on the document’s audience (mobile, for author’s use only, etc.). Either default seems fine.

Your second option, if I understand correctly, proposes to make all strongly emphasized text appear 𝐛𝐨𝐥𝐝 by abusing Unicode math characters. I reject this outright as even more semantically untenable than abusing HTML tags. This is the job of “fancy text generators” for teens’ tweets, not Pandoc.


However, the main culprit of this issue and many others (#3518 #3766 etc.) is the actively confusing status quo that plain output uses Project Gutenberg conventions (including converting emphasis to uppercase). This is not what most people nowadays assume “plain text” means – it should act like the mere loss of formatting caused by pasting formatted (“rich”) text into a basic textbox.

Two years ago I started writing a proposal to revamp plain output, but didn’t finish. If you want to take up my dormant case, here are my notes. As @jgm said, it would be best to start a mailing list thread about this proposal. Sorry for straying off topic and not putting effort into my own crusade.

In summary, to quote your second option very selectively, I would advocate to drastically “change the plain writer’s handling of Strong… instead of capitalizing” – but in a separate issue, and not quite in the way you suggested.

alerque commented 5 years ago

@hftf I think your format analysis should be amended. Exact hits on relevant Unicode is not a last resort and should take priority even over your no.2 approximation formats.

  1. Native math format support
  2. Relevant Unicode where available
  3. Visual approximations
  4. Plain approximations
jgm commented 5 years ago

@hftf to be clear, all three of my options are only meant to affect rendering of math. It won't affect all strongly emphasized text.

hftf commented 5 years ago

Thank you both for the clarifications.

@alerque Sorry if the meaning of my list was unclear. It shows which formats are “more capable.” For example, type 2 is capable of both generic markup and Unicode/plain text, but 3 is only capable of Unicode/plain text. A function intended for type 2 is of little use for 3 since it lacks that capability. In terms of priority, however, I can agree with your list.

@jgm I must be confused since your comment said “This would affect not just math but everything.“

jgm commented 5 years ago

I must be confused since your comment said “This would affect not just math but everything.“

No, I was the one who was confused. You are right. Yes, I agree, better to stay away from that option!

To summarize the two options that remain:

Option 1: Change Text.TexMath.Writers.Pandoc.renderStr so that instead of a Strong element, it uses unicode boldface characters. This would change not just plain output but all formats, e.g. in HTML you'd get unicode boldface characters instead of a <strong> tag. That might not be so bad, if font support is consistent enough.

Option 2: Change handling of Strong in plain writer, but only in math contexts. This would localize the change to plain.

@hftf you've raised another issue: should we depart from the current "Project Gutenberg" conventions for plain output. I can see a case for this, and it might be possible, for example, to add a +gutenberg extension, or perhaps a gutenberg target, and make plain plainer. But this shoud be put on the tracker as a separate issue, I think.

hftf commented 5 years ago

Would you like to raise that issue? I’m awfully busy now, but could try to work on it in a few months.

I think this issue also raises some related issues, so I’m not sure how many should be filed.

  1. Should a plain writer be decoupled from Markdown writer? (closely related to plain revamp)
  2. Should the codebase ever need to capitalize running text? (I exclude fake small caps¹ and normalizing symbols like entity or attribute names².) This type of capitalization is an anomaly in the codebase as it is only used in plain to mimic two Project Gutenberg conventions (viz. strong emphasis and level-1 headings). I’m not convinced users want this behavior – evidence shows many surprised reactions.

____ ¹ See writers for CommonMark, Markdown or plain, FB2, and Ms. ² See usage of toUpper etc.