jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.6k stars 3.38k forks source link

TexMath's "toUnicode" fucntion is only applied for MathML output #2137

Closed GongYiLiao closed 9 years ago

GongYiLiao commented 9 years ago

That following style in TexMath's styleOps:

-- Note: cal and scr are treated the same way, as unicode is lacking such two different sets for those. 
470 styleOps :: M.Map String ([Exp] -> Exp) 
471 styleOps = M.fromList 
472           [ ("\\mathrm",     EStyled TextNormal) 
473           , ("\\mathup",     EStyled TextNormal) 
474           , ("\\mbox",       EStyled TextNormal) 
475           , ("\\mathbf",     EStyled TextBold) 
476           , ("\\boldsymbol", EStyled TextBold) 
477           , ("\\mathbfup",   EStyled TextBold) 
478           , ("\\mathit",     EStyled TextItalic) 
479           , ("\\mathtt",     EStyled TextMonospace) 
480           , ("\\texttt",     EStyled TextMonospace) 
481           , ("\\mathsf",     EStyled TextSansSerif) 
482           , ("\\mathsfup",   EStyled TextSansSerif) 
483           , ("\\mathbb",     EStyled TextDoubleStruck) 
484           , ("\\mathcal",    EStyled TextScript) 
485           , ("\\mathscr",    EStyled TextScript) 
486           , ("\\mathfrak",   EStyled TextFraktur) 
487           , ("\\mathbfit",   EStyled TextBoldItalic) 
488           , ("\\mathbfsfup", EStyled TextSansSerifBold) 
489           , ("\\mathbfsfit", EStyled TextSansSerifBoldItalic) 
490           , ("\\mathbfscr",  EStyled TextBoldScript) 
491           , ("\\mathbffrak", EStyled TextBoldFraktur) 
492           , ("\\mathbfcal",  EStyled TextBoldScript) 
493           , ("\\mathsfit",   EStyled TextSansSerifItalic) 
494           ] 
495 

are applied only if Pandoc's --mathml output option is used, when --mathjax option is specified, above `styleOps are not applied.

Since the conversion between LaTeX commands and Unicode mathematical symbols is already supported in TexMath, it seems a plus to Pandoc if it can convert LaTeX commands to Unicode symbols while --mathjaxor --katex options are specified for `html5 output.

mpickering commented 9 years ago

I'm a bit confused as to what you're asking for. Are you asking for \mathcal{H} to be rendered to the unicode character ℋ? I'm unsure this will produce better results that mathjax's native rendering but I don't have much practical experience with mathjax.

jgm commented 9 years ago

I think it's possible to use MathJax with options to render mathml in the document. So, you can use --mathml and inculde a link to the mathjax script with appropriate options.

I don't want to mess with the latex we're passing to MathJax's latex math renderer.

lierdakil commented 9 years ago

From my experience, mathjax doesn't make a distinction between unicode characters and tex ones, typesetting both with its math fonts anyway. So I suppose this request is more about source readability than anything else. That said, I don't think it's a good idea.

2015-05-07 3:24 GMT+03:00 John MacFarlane notifications@github.com:

I think it's possible to use MathJax with options to render mathml in the document. So, you can use --mathml and inculde a link to the mathjax script with appropriate options.

I don't want to mess with the latex we're passing to MathJax's latex math renderer.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2137#issuecomment-99656154.

GongYiLiao commented 9 years ago

In the case that the output format is docx, Microsoft Word definitely won't process those LaTeX commands well, thus, converting those LaTeX commands into Unicode symbols will be a better solution. Actually, in Text.TeXMath.Writers.Pandoc, the function renderStrs does this kind of work in some cases like TextBoldFrak:

renderStr :: TextType -> String -> Inline
renderStr tt s =
  case tt of
       TextNormal       -> Str s
       TextBold         -> Strong [Str s]
       TextItalic       -> Emph   [Str s]
       TextMonospace    -> Code nullAttr s
       TextSansSerif    -> Str s
       TextDoubleStruck -> Str $ toUnicode tt s
       TextScript       -> Str $ toUnicode tt s
       TextFraktur      -> Str $ toUnicode tt s
       TextBoldItalic    -> Strong [Emph [Str s]]
       TextSansSerifBold -> Strong [Str s]
       TextBoldScript    -> Strong [Str $ toUnicode tt s]
       TextBoldFraktur   -> Strong [Str $ toUnicode tt s]
       TextSansSerifItalic -> Emph [Str s]
       TextSansSerifBoldItalic -> Strong [Emph [Str s]]

A customized filter may work for this, but I think a solution that converts all LaTeX symbol commands (not operator commands like \int) into Unicode symbols can be useful.

lierdakil commented 9 years ago

TeXMath has separate output format for docx. --mathjax and other math options have no effect when converting to docx. Math options are only relevant for HTML-based output, namely Docbook, EPUB, FB2 and HTML. Finally, mathjax only makes sense in HTML output, since it needs javascript to work. I don't think I understand what are you asking for and why do you need it.

If you want us to fix a particular problem, please describe this problem and provide a minimal example showing it. If you lack particular functionality, please describe your use-case. Forgive me if this sounds rude, but at the moment it seems like you are tilting at windmills. No offence.

2015-05-07 16:21 GMT+03:00 Gong-Yi Liao notifications@github.com:

In the case that the output format is docx, Microsoft Word definitely won't process those LaTeX commands well, thus, converting those LaTeX commands into Unicode symbols will be a better solution. Actually, in Text.TeXMath.Writers.Pandoc, the function renderStrs does this kind of work in some cases like TextBoldFrak:

renderStr :: TextType -> String -> Inline renderStr tt s = case tt of TextNormal -> Str s TextBold -> Strong [Str s] TextItalic -> Emph [Str s] TextMonospace -> Code nullAttr s TextSansSerif -> Str s TextDoubleStruck -> Str $ toUnicode tt s TextScript -> Str $ toUnicode tt s TextFraktur -> Str $ toUnicode tt s TextBoldItalic -> Strong [Emph [Str s]] TextSansSerifBold -> Strong [Str s] TextBoldScript -> Strong [Str $ toUnicode tt s] TextBoldFraktur -> Strong [Str $ toUnicode tt s] TextSansSerifItalic -> Emph [Str s] TextSansSerifBoldItalic -> Strong [Emph [Str s]]

I am wondering if we can have a solution that converting all LaTeX symbol commands (not operator commands like \int) into Unicode symbols.

A customized filter may work for this, but I think a solution that converts all LaTeX symbol commands into Unicode symbols can be useful.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2137#issuecomment-99861804.

GongYiLiao commented 9 years ago

Here's the markdown clip for testing:

# Math Tests 

## Test 1

### Without any ```LaTeX``` commands converted to Unicode symbols 

$$ e = \int_\mathbb{R} f(x | \theta) \circ g(\mathbfit{z} |\mathbfit{\eta}) dx $$

## With ```\mathbfit{*}``` commands converted to Unicode symbols before fed to ```Pandoc``` 

$$ e = \int_\mathbb{R} f(x | \theta) \circ g(𝒛|𝜼) dx $$ 

## Test 2

### Without any ```LaTeX``` commands converted to Unicode symbols 

$$ \mathsf{e} = \mathbffrak{z} $$

## With ```\mathbffrak{z}``` command converted to Unicode symbols before fed to ```Pandoc``` 

$$ \mathsf{e} = 𝖟 $$

Below is the screenshot of the output generated with -s -S --mathjax -t html5 option (whose LaTeX code will not be processed by Pandoc and TeXMath) : pandoc_mathjax_screenshot In Test 1's case 1 (without any conversion), it shows that, MathJax does not recognize \mathbfit command thus it shows warnings in red and displays "z" and "η" incorrectly (they should be in boldface). In both tests' case two (convert to Unicode before fed to Pandoc), there is no problem at all.

Another screenshot is the output generated with -s -S -t docx option (whose LaTeX code are partially processed by Pandoc and TeXMath): pandoc_docx_screenshot In Test 1's case 1, the \mathbfit command is just ignored (or not function) where "z" and "η are not in boldface. In Test 2, both cases are displayed correctly, as \mathbffrak are well treated by TeXMath's renderStr function.

From above two output examples, we can find that there's an inconsistency in displaying mathematical letter symbols (not operator symbols) due to different output targets, even they have exactly the same markdown source code.

Thus, it seems helpful to have an unified solution to process those LaTeX commands who have corresponding standardized Unicode letter symbols. For me, a customized filter should work, but I think an unified solution in Pandoc may benefit all the users.

nkalvi commented 9 years ago

@GongYiLiao Pardon me if I didn't understand the issue properly; I was going to try what @jgm suggested (and http://docs.mathjax.org/en/latest/mathml.html#mathjax-mathml-support) so I started with your example as 2137.txt:

pandoc 2137.txt -s -S --mathjax=http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML -t html5 --mathml -o 2137.html

this is what I get:

2137

I'm missing something?

lierdakil commented 9 years ago

Thank you. Now it's much more clear.

From what I gather, these are two separate problems.

First one is mathjax doesn't support \mathbfit, \mathbffrak and possibly other commands, which are supported by TeXMath, at least partially. I don't think this is TeXMath's problem, strictly speaking, since it's MathJax that lacks support for those, but we certainly could implement some sort of workaround for this, or at least document this somewhere.

Second one is \mathbfit doesn't work as expected in docx output. This looks like a genuine bug in TeXMath.

Your proposed solution to both these problems is to preprocess such macros into corresponding unicode symbols. I don't think this is the best possible solution, but it's certainly on the easier side. Personally, I would prefer to address these separately, if possible.

2015-05-07 17:54 GMT+03:00 Gong-Yi Liao notifications@github.com:

Here's the markdown clip for testing:

Math Tests

Test 1

Without any LaTeX commands converted to Unicode symbols

$$ e = \int_\mathbb{R} f(x | \theta) \circ g(\mathbfit{z} |\mathbfit{\eta}) dx $$

With \mathbfit{*} commands converted to Unicode symbols before fed to Pandoc

$$ e = \int_\mathbb{R} f(x | \theta) \circ g(𝒛|𝜼) dx $$

Test 2

Without any LaTeX commands converted to Unicode symbols

$$ \mathsf{e} = \mathbffrak{z} $$

With \mathbffrak{z} command converted to Unicode symbols before fed to Pandoc

$$ \mathsf{e} = 𝖟 $$

Below is the screenshot of the output generated with -s -S --mathjax -t html5 option (whose LaTeX code will not be processed by Pandoc and TeXMath) : [image: pandoc_mathjax_screenshot] https://cloud.githubusercontent.com/assets/129343/7517553/531cda94-f49b-11e4-8f6e-fce41841d72f.png In Test 1's case 1 (without any conversion), it shows that, MathJax does not recognize \mathbfit command thus it shows warnings in red and display "z" and "η" incorrectly (they should be in boldface). Both tests' case two (convert to Unicode before fed to Pandoc) have no problem at all.

Another screenshot is the output generated with -s -S -t docx option (whose LaTeX code are partially processed by Pandoc and TeXMath): [image: pandoc_docx_screenshot] https://cloud.githubusercontent.com/assets/129343/7517750/7902a1a2-f49c-11e4-87e7-c7510d227fb8.png In Test 1's case 1, the \mathbfit command is just ignored (or not function) where "z" and "η are not in boldface. In Test 2, both cases are displayed correctly, as \mathbffrak are well treated by TeXMath's renderStr function.

From above two output examples, we can find there's an inconsistency in displaying mathematical formulas due to different output format, even they have exactly the same markdown source code.

Thus, it seems helpful to have an unified solution to process those LaTeX commands who have corresponding standardized Unicode symbols. For me , a customized filter should work, but I think an unified solution in Pandoc may benefit all the Pandoc's users.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2137#issuecomment-99897253.

jgm commented 9 years ago

@nkalvi if you use the --mathjax option, pandoc will include tex math. What I was suggesting is using the --mathml option, and manually including the relevant mathjax link in your HTML header. (Or pass in the entire HTML link element using -V math=....)

nkalvi commented 9 years ago

@jgm I think that's what I did, and the result seems to be what @GongYiLiao expected (without specifying additional configurations). Could you please check the command line I had?

jgm commented 9 years ago

I think this issue should be migrated over to jgm/texmath. @GongYiLiao - can you open an issue there with your test case and examples?

The problem with \mathbfit in OMML is clear enough: OMML writer, line 99:

       TextBoldItalic    -> [sty "i"]

I can't recall the details of OMML right now, but either this is just a typo on my part, or there's some kind of limitation that prevents you from having bold italics.

jgm commented 9 years ago

@nkalvi you have both --mathjax and --mathml. These are not to be used together (pandoc should probably issue a warning or error if you try). Remove --mathml from the command line and and include an appropriate link element in your header.

nkalvi commented 9 years ago

Thanks @jgm, since I didn't get any errors/warnings and the output looked 'acceptable' I thought it was allowed.

jgm commented 9 years ago

Moved to jgm/texmath#76