jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.68k stars 3.39k forks source link

Support LaTeX environments in Markdown -> HTML conversion #1938

Closed juliangilbey closed 9 years ago

juliangilbey commented 9 years ago

The following piece of LaTeX-enriched markdown:

This is some math.

\begin{aligned}
x&=1\label{eq:1}\\
y&=2
\end{aligned}

End of math. \eqref{eq:1}

converts beautifully to LaTeX with pandoc -f markdown -t latex. However, when converting to html5, even with the --mathjax option, I can't figure out any way to persuade pandoc to maintain the aligned environment or the \eqref, despite the fact that MathJax can handle these.

Any suggestions?

Thanks!

jgm commented 9 years ago
  1. Math in pandoc needs to be inside $..$ or $$..$$ delimiters. Your example worked for latex/pdf output because pandoc passes through raw tex to these formats (but not to HTML).
  2. Labels and references don't work with pandoc math.
  3. It occurs to me that it might make sense to pass through raw latex environments to HTML in the special case where --mathjax is used. This would solve your problem nicely.
timtylin commented 9 years ago

Actually, surround the \begin\end{aligned} with the $$..$$ delimiter, and surround \eqref{eq:1} with the inline $..$. Output to HTML with MathJax. Should work.

timtylin commented 9 years ago

Oh, and make sure you enable standalone mode

nkalvi commented 9 years ago

Very good!

This

This is some math.

$$
\begin{aligned}
x&=1\label{eq:1}\\
y&=2
\end{aligned}
$$

End of math. $\eqref{eq:1}$

with

pandoc math.txt -t html -s -o test.html --mathjax=https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML

Displays as this in Safari (as mentioned above, labels don't work)

screen shot 2015-02-11 at 7 33 46 pm

timtylin commented 9 years ago

Ah, the (???) is actually a recent MathJax bug.

See: https://github.com/mathjax/MathJax/issues/1020

nkalvi commented 9 years ago

It does work as expected with the following changes (@timtylin the bug you mentioned is limited to multi-line labels, and there's workaround for it):

  1. Change aligned to align
  2. Include MathJax function needed for numbering to HTML header

Modified source:

This is some math.

$$
\begin{align}
x&=1\label{eq:1}\\
y&=2
\end{align}
$$

End of math. $\eqref{eq:1}$

Addition to HTML header:

  <script type="text/x-mathjax-config">
    MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "all"} } });
  </script>

Pandoc command:

pandoc math.txt -t html -s -o test.html --mathjax=https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML -H mathjax-header-include.txt 

Result: screen shot 2015-02-11 at 10 16 16 pm

juliangilbey commented 9 years ago

Duh. I meant to originally post the following:

This is some math.

\begin{align}
x&=1\label{eq:1}\\
y&=2
\end{align}

End of math. \eqref{eq:1}

since, of course, the aligned environment only works within a $$...$$ section (and this works fine with pandoc). The align environment could be replaced by align*, equation, gather and so on. The align environment does automatic equation numbering, which is very nice, and mathjax can handle this.

juliangilbey commented 9 years ago

(I would reopen this but I don't know how to - would it be better to open a new issue with the correct report?)

nkalvi commented 9 years ago

I'm not sure why you'd want to reopen it. Wouldn't the method suggested above work for you?

jgm commented 9 years ago

I think it might be worth preserving the suggestion I made, that raw LaTeX blocks should be passed through to HTML when --mathjax is used.

+++ nkalvi [Feb 12 15 14:46 ]:

I'm not sure why you'd want to reopen it. Wouldn't the method suggested above work for you?


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/1938#issuecomment-74170091

nkalvi commented 9 years ago

It looks like that's what Pandoc is doing - am I wrong? Here's the output:

<body>
<p>This is some math.</p>
<p><span class="math">\[
\begin{align}
x&amp;=1\label{eq:1}\\
y&amp;=2
\end{align}
\]</span></p>
<p>End of math. <span class="math">\(\eqref{eq:1}\)</span></p>
</body>

Pardon me if I'm not getting it - I'm quite new to Pandoc and LaTex. It'd be helpful if you could post what the desired output is.

jgm commented 9 years ago
% pandoc -f markdown -t native -t html --mathjax
\begin{aligned}
x = 1 & y = 2\\
\end{aligned}
^D

(output is empty)

Note: I'm talking about raw latex environments that are not included in $$ delimiters.

+++ nkalvi [Feb 12 15 16:21 ]:

It looks like that's what Pandoc is doing - am I wrong? Here's the output:

<body>
<p>This is some math.</p>
<p><span class="math">\[
\begin{align}
x&amp;=1\label{eq:1}\\
y&amp;=2
\end{align}
\]</span></p>
<p>End of math. <span class="math">\(\eqref{eq:1}\)</span></p>
</body>

Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/1938#issuecomment-74182282

nkalvi commented 9 years ago

Thanks for clarifying.

juliangilbey commented 9 years ago

Ah, I see - that does work (putting stuff in $$...$$). It's a shame that the raw environments are not passed through, but it's not a show-stopper. It's also weird that \eqref's have to be placed in $...$ signs.

nkalvi commented 9 years ago

$...$ is 'in line' with the specs :) http://docs.mathjax.org/en/latest/options/tex2jax.html

inlineMath: [['(',')']] Array of pairs of strings that are to be used as in-line math delimiters. The first in each pair is the initial delimiter and the second is the terminal delimiter. You can have as many pairs as you want. For example,

inlineMath: [ ['$','$'], ['(',')'] ] would cause tex2jax to look for $...$ and (...) as delimiters for inline mathematics. (Note that the single dollar signs are not enabled by default because they are used too frequently in normal text, so if you want to use them for math delimiters, you must specify them explicitly.)

juliangilbey commented 9 years ago

Yes, but \eqref is a text-mode LaTeX command: it's a reference to an equation number, not a piece of mathematics. :-)

nkalvi commented 9 years ago

Aha, you're right. MathJax doesn't require $...$ for references outside inline maths. So rewriting it (since it will be stripped away in html output) produces the desired output:

End of math. <span>\\eqref{eq:1}</span>

Now I understand even better jgm's suggestion about passing raw LaTex blocks.

juliangilbey commented 9 years ago

The problem is actually not solved by $$\begin{equation}...\end{equation}$$ and similar: while this is parsed correctly by mathjax when it is converted to html5, when the markdown is converted to latex, the resulting LaTeX is again $$\begin{equation}...\end{equation}$$, on which LaTeX barfs, as \begin{equation} starts by trying to enter math mode, so LaTeX throws up an error.

So it looks like the only solution is to allow pandoc to pass LaTeX blocks and \eqref{...} etc. to HTML raw when --mathjax is specified. (Perhaps with some other command line parameter to control this behaviour, eg -f markdown+raw_tex?)

bpj commented 9 years ago

One could easily write a filter which recognises the .math class on codeblocks and codespans and converts the code text to a RawBlock or RawInline with the right format label and wraps it in the right <span class="math"> w/o <p> for HTML output.

nkalvi commented 9 years ago

@juliangilbey Could you please give examples of input and output? When I tried some samples with http://johnmacfarlane.net/pandoc/try/ and http://www.tlhiv.org/ltxpreview/ it seems to work fine.

Markdown input:

$$
 \frac{1}{\displaystyle 1+
   \frac{1}{\displaystyle 2+
   \frac{1}{\displaystyle 3+x}}} +
 \frac{1}{1+\frac{1}{2+\frac{1}{3+x}}}
$$

Output from pandoc:

\[
 \frac{1}{\displaystyle 1+
   \frac{1}{\displaystyle 2+
   \frac{1}{\displaystyle 3+x}}} +
 \frac{1}{1+\frac{1}{2+\frac{1}{3+x}}}
\]

LaTex preview: screen shot 2015-02-13 at 11 19 53 am

bpj commented 9 years ago

The filter I suggested is here:

https://gist.github.com/baf84ac52dd47205e5cb

Requires perl and some (listed) CPAN modules.

jgm commented 9 years ago

+++ Julian Gilbey [Feb 13 15 07:58 ]:

The problem is actually not solved by $$\begin{equation}...\end{equation}$$ and similar: while this is parsed correctly by mathjax when it is converted to html5, when the markdown is converted to latex, the resulting LaTeX is again $$\begin{equation}...\end{equation}$$, on which LaTeX barfs, as \begin{equation} starts by trying to enter math mode, so LaTeX throws up an error.

So it looks like the only solution is to allow pandoc to pass LaTeX blocks and \eqref{...} etc. to HTML raw when --mathjax is specified. (Perhaps with some other command line parameter to control this behaviour, eg -f markdown+raw_tex?)

There is already an extension for raw tex in the markdown reader (it's enabled by default). So all that would be required would be passing through raw tex when output is HTML and --mathjax is used.

This would be a very easy thing to add.

In the mean time, you could write a filter that finds RawInline (Format "latex") and RawBlock (Format "latex") elements and converts them to raw HTML, properly escaped. This too would be easy, and it wouldn't require any changes in pandoc itself.

bpj commented 9 years ago

Den 2015-02-13 17:09, BPJ skrev:

One could easily write a filter which recognises the .math class on codeblocks and codespans and converts the code text to a RawBlock or RawInline with the right format label and wraps it in the right <span class="math"> w/o <p> for HTML output.

I did a colossal blooper!

Since I don't do math myself I omitted the LaTeX math delimiters in the first version of my filter! Corrected now:

https://gist.github.com/bpj/baf84ac52dd47205e5cb#file-pandoc-wrap-raw-pl

@jgm wrote:

In the mean time, you could write a filter that finds RawInline (Format "latex") and RawBlock (Format "latex") elements and converts them to raw HTML, properly escaped. This too would be easy, and it wouldn't require any changes in pandoc itself.

I think my approach with tagged 'code' may have its use. For one thing it lets you be selective about which LaTeX Raw* elements you want to include in HTML.

/bpj

benstevens48 commented 9 years ago

Hi,

I've had a look at the code for this fix and I don't think it's quite right. In order for MathJax to interpret the raw latex you are outputting to HTML, the latex needs to be inside math delimiters. So where you have written

blockToHtml opts (RawBlock f str)
  | f == Format "html" = return $ preEscapedString str
  | f == Format "latex" =
      case writerHTMLMathMethod opts of
           MathJax _  -> do modify (\st -> st{ stMath = True })
                            return $ toHtml str

I think it should say

...
return $ toHtml $ "\\[" ++ str ++ "\\]"

and correspondingly for the inline case. Ideally I think they should also be put inside the appropriate html span as when you write a math block.

Sorry if I have misinterpreted you code but I hope what I've said is correct.

Ben

jgm commented 9 years ago

Thanks, this may be correct. This change was mostly intended for things like

\begin{equation}
e = mc^2
\end{equation}

which, in LaTeX, would NOT be placed inside math delimiters ($$..$$ or \[..\]), and for things like \ref{eqn:3}. If you use these in MathJax, do you write the following instead?

\[
\begin{equation}
e = mc^2
\end{equation}
\]

$\ref{eqn:3}$

+++ benstevens48 [Mar 02 15 08:45 ]:

Hi,

I've had a look at the code for this fix and I don't think it's quite right. In order for MathJax to interpret the raw latex you are outputting to HTML, the latex needs to be inside math delimiters. So where you have written

blockToHtml opts (RawBlock f str)
 | f == Format "html" = return $ preEscapedString str
 | f == Format "latex" =
     case writerHTMLMathMethod opts of
          MathJax _  -> do modify (\st -> st{ stMath = True })
                           return $ toHtml str

I think it should say

...
return $ toHtml $ "\\[" ++ str ++ "\\]"

and correspondingly for the inline case. Ideally I think they should also be put inside the appropriate html span as when you write a math block.

Sorry if I have misinterpreted you code but I hope what I've said is correct.

Ben


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/1938#issuecomment-76748071

benstevens48 commented 9 years ago

Yes, I'm pretty sure that MathJax scans the page looking for math delimiters and the processes the stuff inside them, so if it's not inside math delimiters then it will just ignore it. The fact that stuff like the equation environment should not be inside math delimiters in Latex is why we were struggling to get both to work, and hence the use of the filter for the workaround!

So, yes, as you said, for mathjax to work, in the html, you write

    \[
    \begin{equation}
    e = mc^2
    \end{equation}
    \]

and

    $\ref{eqn:3}$

or, to be consistent with pandoc's delimiters for mathjax elsewhere,

    \(\ref{eqn:3}\)

I hope this is correct.

Ben

jgm commented 9 years ago

Well, let's confirm that this is correct before continuing, since in LaTeX it wouldn't be correct to do things this way...

+++ benstevens48 [Mar 02 15 10:53 ]:

Yes, I'm pretty sure that MathJax scans the page looking for math delimiters and the processes the stuff inside them, so if it's not inside math delimiters then it will just ignore it. The fact that stuff like the equation environment should not be inside math delimiters in Latex is why we were struggling to get both to work, and hence the use of the filter for the workaround!

So, yes, as you said, for mathjax to work, in the html, you write [ \begin{equation} e = mc^2 \end{equation} ]

and $\ref{eqn:3}$

or, to be consistent with pandoc's delimiters for mathjax elsewhere, (\ref{eqn:3})

I hope this is correct.

Ben

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/1938#issuecomment-76779153
benstevens48 commented 9 years ago

Hi, Sorry, but it seems that actually both methods work with MathJax. So you can mostly ignore everything I said! I do find this sentence in the MathJax getting started guide a bit misleading though: 'Mathematics that is written in TeX or LaTeX format is indicated using math delimiters that surround the mathematics, telling MathJax what part of your page represents mathematics and what is normal text.' There is a potential issue in that any Latex commands such as \newpage that MathJax doesn't recognise will just be left as plain text on the page, whereas if they are inside math delimiters then it is possible to define a \newcommand in the MathJax configuration to deal with this. So it might be better to put the delimiters in as it gives more flexibility, but I'm not sure what the MathJax official best practice is. Sorry for not fully checking this earlier! Ben

Thell commented 9 years ago

@juliangilbey This issue has come up before, multiple times. Early last year I ran in to it and got pretty much the same response regarding a filter and such. The result was similar to yours... :disappointed: So to scratch my itch a patch was submitted that didn't alter the behavior of any of the targets except latex (since the latex target is where the problem exists) without any side-effects.

If you don't mind patching yourself, I've been using it over a year now with great results.

Essentially all it does is strip the $$ or \[ tokens from latex math environments when the target is latex; so just surrounding your latex math environment with the mathjax tokens makes all the targets happy. :smile:

[update] The tex-math-consume-escapes branch has been rebased onto the latest pandoc master. If desired a pull request can be submitted.

mseri commented 9 years ago

+1 for PR

diazona commented 9 years ago

@Thell I'd also like to see this pulled into pandoc proper

Thell commented 9 years ago

@mseri and @diazona we'll need to see what @jgm wants. There are currently quite a few outstanding issues and pull requests and the latest release does at least allow passage of raw blocks (which helps with html/latex targets) so I'm guessing it will be a while unless we can come up with a non edge-case usage example.

rreece commented 6 years ago

With Pandoc 2.2, I'm still having this issue. Naked math latex environments do not make it to the html from pandoc. Note that in order to be processed by mathjax properly, the equation, align, ... environments would need to be wrapped in

<p><span class="math display">
...
</span></p>

Any further advice on how to produce proper html and latex from the same markdown? How should one markdown equations to support both outputs?

jgm commented 6 years ago

@rreece - please give a specific example of a math environment that isn't properly (full instructions for how to reproduce the issue). And probably better to open a new issue, referring to this one, since this one is closed.

rreece commented 6 years ago

Thanks for the reply @jgm! I've submitted a new issue: #4640.