jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.95k stars 3.35k forks source link

Support for mhchem package for LaTeX to docx #6668

Open TonyWu20 opened 4 years ago

TonyWu20 commented 4 years ago

Currently pandoc (2.10.1) still does not support parsing \ce{} command in latex to docx correctly. For example, if I type \ce{CO2} in my tex file and then convert it to docx using pandoc, the resulted docx file will totally miss the CO2, leaving a blank space there.

I found that no one ever mentioned the mhchem support issue here before. As a user heavily use LaTeX to write paragraphs about chemistry, I hope pandoc can support it soon. Thanks!

tarleb commented 4 years ago

Pandoc cannot support all LaTeX package out there, and I'm afraid that this will be one of the cases were we have to draw the line. But you could use a filter to parse and replace the mhchem statements. See also https://stackoverflow.com/q/56387990/.

tarleb commented 4 years ago

Additional resource: related thread on the mailing list.

mhchem commented 4 years ago

I am the inventor and author of mhchem. Let's discuss how I could be of help here. I wrote the LaTeX package and a JavaScript/TypeScript parser. But I don't know any Haskell, yet. My understanding is that transpilation of JavaScript to Haskell is not possible. Correct? It looks like pandoc should stay 100% Haskell. Correct? So, what would be needed here? I could try to learn Haskell and create a Haskell function that takes an mhchem input string and returns the equivalent LaTeX. (This would be a long term project on my side. I cannot do this during a normal work week. And this is subject to me liking Haskell. I don't want to ruin my vacation by doing something I don't like.) The pandoc team would need to create all the wrapping. Detection of \ce etc.. I assume pandocs parsing is left-to-right, so my function would be called with the remainder of the document after \ce? (If that's not the case, we would need to figure out how pandoc could determine the proper length of the argument, because it's not 100% TeX rules. It can contain $ pairs, for instance.) \ce could be used recursively, so the wrapping code should expect my function to return further \ces.

What do you think? Are my assumptions correct? Could this work this way?

tarleb commented 4 years ago

Hello Martin, thanks for chiming in! I believe that you are correct, and that there is no transpilation from JS to Haskell; and yes, this would would have to be coded in Haskell to be shipped with pandoc (but see below).

So, what would be needed here?

The most useful would be a module like SIunitx. It parses LaTeX commands into pandoc's internal document representation.

And this is subject to me liking Haskell. I don't want to ruin my vacation by doing something I don't like.

I very much like your way of thinking :relaxed:

If Haskell turns out not to be your thing, there are two alternatives. One would be to use Lua. The language shares a lot of concepts with JavaScript, so you'd probably be productive in no time. The idea there would be that pandoc can be instructed to keep those LaTeX commands which it doesn't know how to handle, so we can parse them later. The Lua script could then do the parsing and translating, passing the result back to pandoc. Pandoc includes a Lua interpreter, so running the extension would be possible for everyone with a working pandoc installation.

The second alternative is probably the easiest: do the same as described above, but use JavaScript to do the processing. The disadvantages are only that node would be required, plus the performance impact of having to pass the document to and from JS by serializing to JSON. But at least the latter point shouldn't matter too much. The approach would only leave the challenge of having to translate the parsed state into pandoc's internal format (or directly into specific output formats).

Possible problems with these approaches could stem from pandoc mis-parsing the mhchem commands as a whole, e.g., creating multiple chunks out of something that's really just one command. Not entirely sure how likely that would be.

mhchem commented 4 years ago

Hi Albert. Thanks for explaining the three approaches. I guess they would work for the easy chemical formulas.

But when I think about the more complex ones I don't see this fit. We have \ce inside (LaTeX) math ($V_{\ce{H2O}$ = volume of water), we have math inside \ce ($\ce{A ->[$x$][$x_i$] B}$) and we can have those two nested quite deeply (e.g. a current question on Chemistry StackExchange).

So, parsing mhchem as a last step wouldn't work, because it could contain further math.

Also, parsing directly into pandoc's internal document representation can only work if it has a "This is LaTeX which needs another parser run" object. And if the internal representation could be nested (e.g. as a subscript inside a LaTeX expression).

Would it work if the mhchem parser returns a string with LaTeX syntax? I might be biased here, because my other mhchem implementations work this way. But I don't see the other approaches working nicely for things like $m_{\ce{NO_$x$}}$ (=$m_{\ce{NO_x}}$) where NO_ is translated by mhchem while the outer $m_$ and the inner $x$ are handled by the LaTeX parser.

jgm commented 4 years ago

If you can put in your own macro definition for \ce (and it's a straightforward LaTeX macro, not one with tex primitives), then pandoc should be able to interpret it while parsing LaTeX. That could avoid the whole issue.

mhchem commented 4 years ago

The inner workings of \ce are not what you would call straight-forward LaTeX, I guess. It is LateX3 syntax, but it does some fundamental things like working with catcodes. (A lot of this complicated stuff is needed to switch off the default LaTeX behavior, so that another syntax can be implemented. mhchem.sty) So, it would be much easier if you could extend your LaTeX parser: Whenever it finds a \ce it would call mhchem's parser function (which returns a LaTeX string). Think of it as a simple string replacement. After that, you continue parsing the LaTeX (including the replaced part) as before.

jgm commented 4 years ago

Whenever it finds a \ce it would call mhchem's parser function (which returns a LaTeX string)

No, we're not going to modify pandoc so that it shells out to a JavaScript/TypeScript executable.

You should be able to use a Lua filter, though, to do this. The Lua filter would match on Math elements (or RawInline (Format "latex"), if these things occur outside of math mode). It could then pipe the content through your program and reinsert the result.

mhchem commented 4 years ago

No, we're not going to modify pandoc so that it shells out to a JavaScript/TypeScript executable.

Nobody suggested that. I am discussing how an mhchem parser in Haskell could be integrated in the pandoc's scanning process of LaTeX code.

mhchem commented 4 years ago

You should be able to use a Lua filter, though, to do this. The Lua filter would match on Math elements (or RawInline (Format "latex"), if these things occur outside of math mode). It could then pipe the content through your program and reinsert the result.

That sounds good. I just learned (by reading https://pandoc.org/lua-filters.html) that pandoc does not parse (La)TeX in the first run, but retains the whole string as pandoc.Math (or similar). Yes, my filter could modify this string and hand it back. I'll follow up on this.

(So, pandoc parses this math string later on? I guess so, otherwise you would not be able to convert this to docx etc, would you?)

jgm commented 4 years ago

I am discussing how an mhchem parser in Haskell could be integrated in the pandoc's scanning process of LaTeX code.

Sorry, I missed that.

In principle, we could do that, but I'm a little hesitant. We try to handle commonly used LaTeX packages, but we can't make our ambition that of supporting everything, or it will balloon out of all proportion.

If the needed code is fairly compact, I might consider it. If it's a lot, I'd be more inclined to say: people who need this can use a simple lua filter and shell out to your existing program.

jgm commented 4 years ago

(So, pandoc parses this math string later on? I guess so, otherwise you would not be able to convert this to docx etc, would you?)

Exactly, that is done by the jgm/texmath library, which converts between tex math and several other formats (Word equations, MathML, roff eqn). The library has some limitations, so you'd need to make sure that the output of your script can be processed by it. This may not be the case if it uses a lot of lower-level tex.

mhchem commented 4 years ago

We try to handle commonly used LaTeX packages, but we can't make our ambition that of supporting everything

I don't want to brag, but mhchem might qualify as a commonly used LaTeX package. I just looked up some numbers: Chemistry StackExchange has 23724 posts using \ce. TeX StackExchange has 529 users with posts containing {mhchem} (for comparison: 1899 users with {siunitx}).

I'll take a closer look at the Lua approach. Thanks.

hubgit commented 3 years ago

@mhchem You may find this pandoc filter helpful as an example - it's used for converting equations from LaTeX to SVG using MathJax, all in JavaScript/TypeScript.

In this case the input would be mhchem syntax and the output would be LaTeX, but the process should be similar.

tarleb commented 3 years ago

If I understand the discussion correctly, then full support would require changes which are forbiddingly high effort. There would have to be an mhchem equivalent to texmath. However, a filter seems to be an adequate if imperfect solution.

It seems that there is currently not much we can do, thus closing. Let me know if my analysis was incorrect.

jdpipe commented 2 years ago

Let me just say that mhchem is very widely used in a really large number of scientific disciplines, including engineering, chemistry, physics, geology, biology, ecology and environmental science. It is really not correct to think of it as a marginal, rarely-used Latex package.

I wanted to mention that KaTeX has implemented filters for mhchem which perhaps might be useful in the efforts here. https://katex.org/docs/node.html#using-mhchem-extension

jgm commented 2 years ago

Point taken about its wide usage. However, as tarleb notes above, it would be a lot of work to implement the macros. If some mhchem user who knows Haskell wants to do it, we can talk! (Rudimentary support would not be too hard, e.g. for things like \ce{O2}, but complete support is quite involved.)

jgm commented 2 years ago

The most basic thing to support would be

\ce{H2O}
\ce{Sb2O3}

Note that these can be used both in text mode and in math mode. That's already a bit tricky. Implementing it in math mode would require either (a) adding support for mhchem to texmath, or (b) expanding \ce macros to something texmath can handle. The second option is preferable. But we can't use the same expansion for both math mode and text mode, since the LaTeX for subscripts is different in these modes.

One practical approach would be to generate math from every \ce macro, even if it's not in math mode; this would simplify pandoc's task (since we'd always be expanding to math mode), at the price of generating slightly different output: on this approach, $\ce{H2O}$ and \ce{H2O} would not render differently.

jdpipe commented 2 years ago

At the moment, there is a terrible lack of functional tex-to-word or tex-to-libreoffice filters. Anything moderately functional, even if hacky, so much better than nothing at all. Just pass the \ce{...} content through as plain text for a start! The syntax is basically human readable. It is understood that hand-editing is unavoidable. Do you have a systematic way of highlighting/flagging 'imperfectly translated' content, perhaps? Would make the hand-editing process much easier.

FYI my best alternative at the moment is to create a PDF, then upload it to an Adobe website, download a Word document, then load it into LibreOffice. The results are terrible. Nearly anything would be an improvement. Perfection is not necessary.

jgm commented 2 years ago

Just pass the \ce{...} content through as plain text for a start!

That's very easy to achieve with a small Lua filter.

% cat mhchem.lua
function RawInline(el)
  if (el.format == 'latex' or el.format == 'tex') and
      el.text:match("\\ce{") then
    -- strip off \\ce{ and }
    local inner = el.text:sub(5, -2)
    return pandoc.Str(inner)
  end
end

% pandoc -L mhchem.lua -f latex+raw_tex
I used 2 grams of \ce{CO2}.
^D
<p>I used 2 grams of CO2.</p>

(Note: this won't cover \ce commands that occur in math mode. To cover those as well, you could add function Math(el) to the filter and replace the contents of the \ce command with something like the above, surrounded by \text{}.)

jgm commented 2 years ago

Actually, since you can use lpeg in writing Lua filters, it might be fun to write a little lpeg grammar for mhchem; then the filter could be fairly fully featured, including subscripts and superscripts and bonds and the like. Occurrences in math mode could be handled in the way suggested above.

jgm commented 2 years ago

Here's a start on a more sophisticated filter that uses a grammar:

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local number = R"09"^1 * (P"." * R"09"^1)^-1
local fraction = number * "/" * number
local symbol = C(S"()[]") + (P"\\" * C(S"{}"))

local thinspace = utf8.char(0x2009)

Mhchem = P{ "Formula",
  Formula = Ct(( V"Molecule"
               + V"Math"
               + V"Sup"
               + V"Sub"
               + V"Number"
               + V"Letter"
               + V"Symbol"
               + whitespacechar^1
               )^0) * P(-1);
  Molecule = V"MoleculePart"^1 ;
  MoleculePart = V"Element" * V"ElementSub"^-1 ;
  Element = C(R"AZ" * R"az"^0) / pandoc.Str ;
  ElementSub = C(R"09"^1) / pandoc.Str / pandoc.Subscript ;
  Letter = R"az" / pandoc.Str ;
  Number = fraction + C(number) /
    function(s) return pandoc.Str(s .. thinspace) end ;
  Sup = ((P"^" * (V"InBraces" + C(R"09"^0 * S"+-"^-1)))
         + C(S"+-"))
  / pandoc.Str / pandoc.Superscript ;
  Sub = (P"_" * (V"InBraces" + C(R"09"^0 + S"+-"^-1))) /
     pandoc.Str / pandoc.Subscript ;
  Math = P"$" * C((P(1) - P"$")^1) * P"$" /
    function(s) return pandoc.Math("InlineMath", s) end ;
  Symbol = symbol / pandoc.Str;
  InBraces = P"{" * C(((P(1) - P"}") + V"InBraces")^0) * P"}"
  }

function handleCe(s)
  local inner = s:sub(5,-2) -- strip off \ce{ and }
  local result = lpeg.match(Mhchem, inner)
  if not result then
    io.stderr:write("Could not parse mhchem formula " .. inner .. "\n")
  end
  return result
end

function RawInline(el)
  if (el.format == "latex" or el.format == "tex") and
      el.text:match("\\ce{") then
    return handleCe(el.text)
  end
end

function RawBlock(el)
  if (el.format == "latex" or el.format == "tex") and
      el.text:match("\\ce{") then
    local ils = handleCe(el.text)
    if ils then
      return pandoc.Para(ils)
    end
  end
end

Example of use:

% pandoc -L mhchem.lua -f latex+raw_tex -t html
\ce{2H2O2} and \ce{H+} and  \ce{^{227}_{90}Th+}
^D
<p>2 H<sub>2</sub>O<sub>2</sub> and H<sup>+</sup> and <sup>227</sup><sub>90</sub>Th<sup>+</sup></p>
mhchem commented 2 years ago

The approach above looks interesting. Could you help me understand what it does and when? I understand that this filter is called after the document has been parsed a first time, to distinguish text (function RawInline(el), function RawBlock(el)) and math mode (function Math(el)).

I'm not sure LPeg would be the way to go. I don't know it and just had a quick glance at the documentation. I am impressed by it's compactness, but I feel that 1500 Lines of TypeScript do not easily fit that grammar in a way that fits into a (rather: my) brain.

jgm commented 2 years ago

This is a transformation of the AST generated by the LaTeX parser. That AST contains some RawInline and RawBlocks elements -- basically for bits of LaTeX that pandoc didn't understand how to convert into other elements, like \ce{..}. So what this filter does is replace each RawInline containing a \ce{} command with a list of Inlines that will render it appropriately in any output format (Str, Subscript, etc. ... but also, sometimes, Math ... there is no need to parse this when its contents are math that pandoc can handle, as in $x$ in the mhchem manual).

Currently this filter doesn't do anything to handle \ce inside math mode, e.g. $\ce{CO2} + \ce{H2O}$. When you convert to docx, you'll get this literal string because pandoc won't be able to convert the equation. In principle, this could be fixed too, but it's harder because currently the Lua API doesn't expose the writers. If they did, we could do the same transformation we do in text mode, then render the result as LaTeX, enclose this in a \text{..} command, and insert it back into the math context.

jgm commented 2 years ago

I suppose that we could do the following to handle math mode. Instead of hard-coding the actions in the grammar

  ElementSub = C(R"09"^1) / pandoc.Str / pandoc.Subscript ;

we could use a table,

  ElementSub = C(R"09"^1) / render.str / render.subscript ;

We could make render a parameter of Mhchem. When we're in text mode, we'd pass in a version of the table with instructions that make sense for text mode, e.g.

renderText = {
  str = pandoc.Str,
  subscript = pandoc.Subscript
}

And in math mode, we'd pass in a different table:

renderMath = {
  str = escapeTeX,
  subscript = function(x) return ("_{" .. x "}") end
}

EDIT: another option would be to use math mode for all the \ce commands. Not perfect but perhaps good enough. We're going to need to use Math anyway for things like the stacked numbers in isotopes or the arrows with text over them; there's no other way to represent this in the pandoc AST. I now think that's the best approach.

jgm commented 2 years ago

Another way to go would be to write a JSON filter in JavaScript/Typescript. This filter could use your parser to convert the \ce commands into regular LaTeX, then use pandoc.read to parse this into a native pandoc AST (in text mode) or just splice in the LaTeX (in math mode). There is a JavaScript (node) library for writing pandoc JSON filters here: https://github.com/mathematic-inc/node-pandoc-filter

[EDIT: scrubbed this idea because pandoc.read is available to Lua filters, but not JSON filters. You could shell out to pandoc for the conversion, but this would be inefficient.]

jgm commented 2 years ago

I think I see how to do this now. I'll try to produce a version of this filter that gives decent results on most of your test cases, and then I'll link to it.

jdpipe commented 2 years ago

So pleased to see the progress on this! :-)

mhchem commented 2 years ago

I just re-read @hubgit 's post above. Hmm, if we already have a MathJax filter, why don't we use that? MathJax has perfect mhchem support.

jgm commented 2 years ago

You could indeed use a filter that uses MathJax to produce SVGs and then includes the SVGs in the document. But that means all your math and chemical formulas turns into images. Wouldn't you rather have the math and chemical formulas be native Word equations (in docx) or mathml (in DocBook) or eqn (in ms)?

jgm commented 2 years ago

@mhchem the manual says:

mhchem tries to differentiate whether \ce{-} should be a bond, a charge or a hyphen.

Under what conditions is it each of these things?

jgm commented 2 years ago

Here is the latest version of the filter. This handles around 70% of the examples in pp. 4-12 of the mhchem manual. To use it, save this as mhchem.lua and do pandoc -L mhchem.lua.

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local number = (R"09"^1 * (P"." * R"09"^1)^-1)
local symbol = C(S"()[],") + (P"\\" * C(S"{}"))

local function escapeTeX(x)
  return x:gsub("%%","\\%")
          :gsub("\\","\\\\")
          :gsub("([{}])", "\\%1")
end

local arrows = {
  ["->"] = "\\longrightarrow",
  ["<-"] = "\\longleftarrow",
  ["<->"] = "\\longleftrightarrow",
  ["<-->"] = "\\longleftarrow\\longrightarrow",
  ["<=>"] = "\\rightleftharpoons",
  ["<=>>"] = "\\longRightleftharpoons",
  ["<<=>"] = "\\longLeftrightharpoons"
}

local bonds = {
  ["-"] = "{-}",
  ["="] = "{=}",
  ["#"] = "{\\equiv}",
  ["1"] = "{-}",
  ["2"] = "{=}",
  ["3"] = "{\\equiv}",
  ["..."] = "{\\cdot}{\\cdot}{\\cdot}",
  ["->"] = "{\\rightarrow}",
  ["<-"] = "{\\leftarrow}"
}

-- math mode renderer
local render =
  { str = function(x)
      if #x > 0 then
        return "\\text{" .. escapeTeX(x) .. "}"
      else
        return ""
      end
    end,
    element = function(x) return "\\mathrm{" .. escapeTeX(x) .. "}" end,
    superscript = function(x) return "^{" .. x .. "}" end,
    subscript = function(x) return "_{" .. x .. "}" end,
    number = function(x) return x end,
    math = function(x) return x end,
    fraction = function(n,d) return "\\frac{" .. n .. "}{" .. d .. "}" end,
    fractionparens = function(n,d) return "(" .. n .. "/" .. d .. ")" end,
    greek = function(x) return "\\mathrm{" .. x .. "}" end,
    arrow = function(arr, above, below)
      local result = arrows[arr]
      if above then
        result = "\\overset{" .. above .. "}{" .. result .. "}"
      end
      if below then
        result = "\\underset{" .. below .. "}{" .. result .. "}"
      end
      return result
    end,
    precipitate = function() return "\\downarrow " end,
    gas = function() return "\\uparrow " end,
    bond = function(s) return bonds[s] or s end,
    circa = function() return "{\\sim}" end
  }

Mhchem = P{ "Formula",
  Formula = Ct( V"FormulaPart"^0 ) * P(-1) / table.concat;
  FormulaPart =  V"Molecule"
               + V"ReactionArrow"
               + V"Bond"
               + V"Sup"
               + V"Sub"
               + V"Charge"
               + V"Fraction"
               + V"Number"
               + V"Math"
               + V"Precipitate"
               + V"Gas"
               + V"Letters"
               + V"GreekLetter"
               + V"Text"
               + V"EquationOp"
               + V"Space"
               + V"Circa"
               + V"Symbol" ;

  Molecule = V"StoichiometricNumber"^-1 * V"MoleculePart"^1 ;
  MoleculePart = V"Element" * V"ElementSub"^-1 ;
  StoichiometricNumber = (V"Number" + C(R"az") + V"Math" + V"Fraction") *
                          Cc("\\;") * whitespacechar^0 ;
  Element = C(R"AZ" * R"az"^0) / render.element ;
  Charge = B(R"AZ" + R"az" + S")]}") * C(S"+-") * #-R"AZ" /
    render.str / render.superscript ;
  ElementSub = C(R"09"^1) / render.str / render.subscript ;
  Precipitate = whitespacechar^0 * (P"(v)" + P"v") * whitespacechar^0 /
    render.precipitate ;
  Gas = whitespacechar^0 * (P"(^)" + P"^") * whitespacechar^0 /
    render.gas ;
  Bond = (C(S"#=-") * #R"AZ" / render.bond) +
         (P"\\bond{" * C((P(1) - P"}")^0) * P"}" / render.bond) ;
  Letters = R"az"^1 / render.str ;
  Number = C(number) / render.number;
  NumberOrLetter = V"Number" + V"Letters" ;
  Fraction = (P"(" * V"NumberOrLetter"^1 * P"/" * V"NumberOrLetter"^1 * P")"
              / render.fractionparens) +
             (V"NumberOrLetter" * P"/" * V"NumberOrLetter" / render.fraction);
  Sup = P"^" * (V"InBraces" + (C(S"+-"^-1 * R"09"^0 * S"+-"^-1) / render.str)) /
    render.superscript ;
  Sub = P"_" * (V"InBraces" + (C(S"+-"^-1 * R"09"^0 * S"+-"^-1) / render.str)) /
    render.subscript ;
  Math = P"$" * Cs((V"MathPart" + V"CEPart")^1) * P"$" / render.math ;
  MathPart = C((P(1) - (P"$" + V"CEPart"))^1) ;
  CEPart = P"\\ce{" * Ct((V"FormulaPart" - P"}")^0) * P"}" / table.concat ;
  GreekLetter = C(P"\\" *
    (( P"alpha" + P"beta" + P"gamma" + P"delta" + P"epsilon" +
      P"zeta" + P"eta" + P"theta" + P"iota" + P"kappa" +
      P"mu" + P"nu" + P"xi" + P"omicron" + P"pi" + P"rho" + P"sigma" +
      P"tau" + P"upsilon" + P"phi" + P"xi" + P"psi" + P"omega"
     ) +
    (( P"Alpha" + P"Beta" + P"Gamma" + P"Delta" + P"Epsilon" +
      P"Zeta" + P"Eta" + P"Theta" + P"Iota" + P"Kappa" +
      P"Mu" + P"Nu" + P"Xi" + P"Omicron" + P"Pi" + P"Rho" + P"Sigma" +
      P"Tau" + P"Upsilon" + P"Phi" + P"Xi" + P"Psi" + P"Omega" )))) *
      whitespacechar^0 / render.greek ;
  EquationOp = whitespacechar^0 *
      C(P"+" + P"-" + P"=" + (P"\\pm")) *
      whitespacechar^0 /
      render.math;
  ReactionArrow =
    whitespacechar^0 *
    C(P"->" +
      P"<-->" +
      P"<->" +
      P"<-" +
      P"<=>>" +
      P"<=>" +
      P"<<=>") *
      (P"[" * Cs((V"FormulaPart" - P"]")^0) * P"]")^-2 *
      whitespacechar^0 / render.arrow ;
  Text = V"InBraces" ;
  Circa = P"\\ca" * whitespacechar^0 / render.circa ;
  Space = C(whitespacechar^1) / "~" ;
  Symbol = symbol / render.str;
  InBraces = P"{" * Ct((((V"FormulaPart" - S"{}")^1) + V"InBraces")^0) * P"}" /
    table.concat
  }

function handleCe(s)
  local inner = s:sub(5,-2) -- strip off \ce{ and }
  local result = lpeg.match(Mhchem, inner)
  if not result then
    io.stderr:write("Could not parse mhchem formula " .. inner .. "\n")
    return s
  end
  return result
end

function RawInline(el)
  if (el.format == "latex" or el.format == "tex") and
      el.text:match("\\ce{") then
    local result = handleCe(el.text)
    if result then
      return pandoc.Math("InlineMath", handleCe(el.text))
    end
  end
end

function RawBlock(el)
  local il = RawInline(el)
  if il then
    return pandoc.Para(il)
   end
end

function Math(el)
  el.text = string.gsub(el.text, "(\\ce%b{})", handleCe)
end
jgm commented 2 years ago

I've improved this more and added it to the pandoc/lua-filters repository: https://github.com/pandoc/lua-filters/tree/master/mhchem

There's a sample there that shows how the manual's examples render in docx: test.docx.

As you can see, there are a few that don't convert well (due to lack of support in texmath for the symbols used), and there are some minor infelicities, but it's much better than no support!

tarleb commented 2 years ago

From a discussion I had on this topic: maybe we could use the mhchem MathJax plugin to convert to MathML, then read that back into pandoc. This should (theoretically) result in a fairly good conversion.

jiucenglou commented 1 year ago

I've improved this more and added it to the pandoc/lua-filters repository: https://github.com/pandoc/lua-filters/tree/master/mhchem

There's a sample there that shows how the manual's examples render in docx: test.docx.

As you can see, there are a few that don't convert well (due to lack of support in texmath for the symbols used), and there are some minor infelicities, but it's much better than no support!

diff -r mhchem.lua D:\Downloads\lua-filters-master\mhchem\mhchem.lua
131c131
<   Sup = P"^" * (V"InBracesSuper" + V"SuperscriptedRadical" +    -- so that we can write \ce{O2^.-} instead of \ce{O2^{.-}} for superoxide anion 1/2
---
>   Sup = P"^" * (V"InBracesSuper" +
175,176c175
<                 * P"}" / table.concat ;
<   SuperscriptedRadical = Ct((P"." / "\\bullet ") * C(S"+-"^-1)) / table.concat ;    -- so that we can write \ce{O2^.-} instead of \ce{O2^{.-}} for superoxide anion 2/2
---
>                 * P"}" / table.concat

@jgm Could you add the change above so that we can write \ce{O2^.-} instead of \ce{O2^{.-}} for superoxide anion ? :D

jgm commented 1 year ago

@jiucenglou why don't you submit a pull request at https://github.com/pandoc/lua-filters

jiucenglou commented 1 year ago

@jiucenglou why don't you submit a pull request at https://github.com/pandoc/lua-filters

I will try a pull request :D Many thanks for your efforts in developing this filter !

TonyWu20 commented 1 year ago

I've improved this more and added it to the pandoc/lua-filters repository: https://github.com/pandoc/lua-filters/tree/master/mhchem

There's a sample there that shows how the manual's examples render in docx: test.docx.

As you can see, there are a few that don't convert well (due to lack of support in texmath for the symbols used), and there are some minor infelicities, but it's much better than no support!

Thank you very much for the filter! I ran the test.txt but it shows:

[WARNING] Could not convert TeX math {\mathrm{A}}\longRightleftharpoons{\mathrm{B}}, rendering as TeX:
  ngRightleftharpoons{\mathrm{B}}
                     ^
  unexpected control sequence \longRightleftharpoons
  expecting "%", "\\label", "\\tag", "\\nonumber" or whitespace
[WARNING] Could not convert TeX math {\mathrm{A}}\longLeftrightharpoons{\mathrm{B}}, rendering as TeX:
  ngLeftrightharpoons{\mathrm{B}}
                     ^
  unexpected control sequence \longLeftrightharpoons
  expecting "%", "\\label", "\\tag", "\\nonumber" or whitespace
[WARNING] Could not convert TeX math {\mathrm{A}}{\tripledash}{\mathrm{B}}{\rlap{\lower.1em{-}}\raise.1em{\tripledash}}{\mathrm{C}}, rendering as TeX:
  hrm{A}}{\tripledash}{\mathrm{B}}{\rlap{\
                     ^
  unexpected control sequence \tripledash
  expecting "%", "\\label", "\\tag", "\\nonumber" or whitespace
[WARNING] Could not convert TeX math {\mathrm{A}}{\rlap{\lower.2em{-}}\rlap{\raise.2em{\tripledash}}-}{\mathrm{B}}{\rlap{\lower.2em{-}}\rlap{\raise.2em{\tripledash}}-}{\mathrm{C}}{\rlap{\lower.2em{-}}\rlap{\raise.2em{-}}\tripledash}{\mathrm{D}}, rendering as TeX:
  {\mathrm{A}}{\rlap{\lower.2em{-}}\rlap{\
                    ^
  unexpected control sequence \rlap
  expecting "%", "\\label", "\\tag", "\\nonumber" or whitespace
TonyWu20 commented 1 year ago

It works pretty well on my own .tex file, though it seems the \ce inside \table are not parsed.