mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
759 stars 140 forks source link

Support LaTeX style equations #86

Closed dyedgreen closed 5 years ago

dyedgreen commented 5 years ago

I have a use-case where I need to support LaTeX style equations (e.g. $a^2+b^2=c^2$, $$\int_a^b x dx$$ for display).

I managed to cobble something together to have an extension that emits these as spans. The extension can be activated using a flag, much like the ~strike through~ extension. I haven't written any tests or added support for it in the included html renderer (in part because I need a custom renderer and in part because there is no standard way of displaying LaTeX equations in html).

If you want, I can create a pull requests and with a few pointers I could add in the support in the renderer and add some tests.

mity commented 5 years ago

I understand there is quite a large demand within Markdown user community for the feature. E.g. https://talk.commonmark.org/search?q=math shows that clearly. So yes, supporting it would be great thing.

That said, I have several points though:

  1. It seems there is no real consensus about the syntax yet (if ever there will be), see e.g. https://github.com/cben/mathdown/wiki/math-in-markdown.

    Given the large need, I am still fine to add support for it, but it would be good if we add ideally a syntax which seems to be the most common one. Do you know current state of this support in other Markdown implementations?

  2. Also, even though you mentioned just spans, there is clearly also demand for math blocks as well. So the syntax should be somehow friendly for eventual expansion in that direction. I.e. the math syntax inside the block and the span should ideally be the same once the blocks are supported as well, so that only the outer decoration would differ. Is it fine from this POV?

  3. What about false positives? Can the extension deal reasonably with $ when it isn't meant to mark LaTeX equation? Consider e.g. this:

    I thought the ticket was $20, but instead it was $25.

    Pandoc uses some heuristics for this so we can likely follow it?

  4. Your 2nd example includes backslash. Backslashes have special meaning in Markdown so we have to be careful so the two do not interact badly.

  5. Any new syntax feature must have its counterpart in the HTML renderer because all our testing is based on it. However it would be completely fine if it generates something super-simple as e.g. <span class="math inline">\(1+1=2\)</span>, similarly as Pandoc does.

(Sorry for possibly banal questions. I am not LaTeX math user myself.)

dyedgreen commented 5 years ago

Regarding your points:

  1. My extension supports $ for inline and $$ for display equations (like LaTeX), and is very lenient in e.g. accepting $...$$ and $$...$ where it only cares about the opening. (This is possibly something that should be changed, but doing it this way was the easiest and fastest way for now). I don't really know what other implementations do.

  2. I think the way to go here is just supporting spans. The difference in math typesetting is between inline equations that flow with the text:

    bla bla bla $a+b=c$ bla bla
    
    vs.
    
    bla bla bla:
    $$
    a+b=c,
    $$
    bla bla.

    So in that sense the equations should behave like images, that can be part of a larger paragraph. (It is common to regard equations as part of the sentence, both when they are inline and when they are display i.e. take up their own line etc.)

  3. What I have right now requires to escape $ and $$ as \$, \$\$. However the heuristics for the single dollar looks good. I'd probably need some pointers regarding how to best implement such a heuristic though.

  4. LaTeX uses backslashes all over the place, so the equation spans contents are treated as code-blocks which allows the backslashes (there is a drawback where you can't include $ in an equation, but backslashes are so common in LaTeX equations that I think the trade-off is worth it)

  5. The renderer in my project simply emits $($)...$($) as <equation (type="display")>..., so I could add that and the latex flag to the renderer and also write a test file.

I think I'll submit a pull request later today with what I have right now and the extension to the html renderer plus the test cases. From there you could point out how I can run the tests etc. and what you might like to change about the implementation.

mity commented 5 years ago

My extension supports $ for inline and $$ for display equations (like LaTeX).

Is "display equation" something what makes it stand out ion the text as e.g. here?

If yes, it then imho corresponds more or less to a Markdown block even though it is a part of a sentence.

  1. Markdown main philosophy is that its source (to a reasonably degree) looks the same as some rendered document.
  2. Quite commonly code blocks are used the same way. See e.g. here or here.

Or would it cause some troubles when renderering the LaTeX output?

I think I'll submit a pull request later today with what I have right now and the extension to the html renderer plus the test cases.

Sounds fine to me.

dyedgreen commented 5 years ago

Yes, that is exactly what is meant by a display equation (the ones that are on their own line and centred). But I think that they should still be treated as spans for the following reason: in LaTeX, there is a difference between:

bla bla
$$
equation
$$

bla bla 

and

bla bla
$$
equation
$$
bla bla

the first example opens a new paragraph, while the second example does not. Much like with images in markdown:

bla bla
![my image](https://...)

bla bla

vs

bla bla
![my image](https://...)
bla bla

So I think handling it as a span in both cases is the simplest way.