mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
759 stars 140 forks source link

Added support for LaTeX equations #87

Closed dyedgreen closed 5 years ago

dyedgreen commented 5 years ago

Supports parsing for $ and $$. This addresses #86.

codecov[bot] commented 5 years ago

Codecov Report

Merging #87 into master will decrease coverage by 0.07%. The diff coverage is 89.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #87      +/-   ##
==========================================
- Coverage   93.99%   93.91%   -0.08%     
==========================================
  Files           1        1              
  Lines        2629     2662      +33     
==========================================
+ Hits         2471     2500      +29     
- Misses        158      162       +4
Impacted Files Coverage Ξ”
md4c/md4c.c 93.91% <89.74%> (-0.08%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update f0de199...5a8e260. Read the comment docs.

mity commented 5 years ago

Generally, the approach seems in the right direction to me.

dyedgreen commented 5 years ago

Please see committed changes. Hope that resolves the issues around the naming.

dyedgreen commented 5 years ago

Reviewed mainly the API so far, not the implementation itself.

Regarding the implementation: It basically recreates the way ~strikethrough~ is implemented, with the only difference being that text inside $/$$ is considered code and the delimiters are limited to be between 1 and 2 characters long.

mity commented 5 years ago

Yep. The interface now looks good to me.

Regarding the implementation: It basically recreates the way strikethrough is implemented, with the only difference being that text inside $/$$ is considered code and the delimiters are limited to be between 1 and 2 characters long.

Noticed that. I'm just wondering whether the code span approach which takes marker length into account wouldn't be better. That would fix the strange mixing of $...$$, but maybe could cause some other unwanted side effects.

But I need fresh mind for that. It's 0:48 AM here and I am getting too sleepy...

dyedgreen commented 5 years ago

Another comment: Should we change the name from MD_SPAN_LATEXMATH to MD_SPAN_MATH? I think that $/$$ to open equations, while it does come from LaTeX, is a very common convention now, so it may make sense to not have LaTeX in the name. Especially since this could very well be used to handle other math markup when given a $ / $$, not just LaTeX (since we treat it as text to be passed to the renderer verbatim anyway).

dyedgreen commented 5 years ago

Another thing: did I add the tests correctly? Codecov does not seem to recognise the additional tests.

mity commented 5 years ago

Should we change the name from MD_SPAN_LATEXMATH to MD_SPAN_MATH?

I don't know. If we go that way then the docs/comments can only explain the actual syntax inside the span is sort of a contract between authors of the documents and the particular renderer; and not that it follows some particular syntax. And also in such case, why to use "math" name at all? It could use even other semantics (e.g. chemistry) if it is out of scope of our specification.

But in some ways it looks to me as giving up and contributing to the Markdown babel: Every renderer could expect completely different syntax there and that's imho way to the hell.

So I would rather stick to the "LaTeX math" and specification would refer to LaTeX docs and what it says can be inside of it. Simple renderers would be recommended to output it verbatim, complex one could parse it as LaTeX does.

If someone (mis)uses it for a different purpose (or different math syntax) he will at least know he does somethign non-standard.

mity commented 5 years ago

Another thing: did I add the tests correctly? Codecov does not seem to recognise the additional tests.

It likely does not work in PRs correctly because there are two CI systems setup for the project: AppVeyor (for Windows MSVC builds) and Travis (for Linux builds). Only the Linux build runs the test suite and feeds CodeCov with the code coverity data, but unfortunately the Travis CI seems to ignore PRs and does not make build for them.

IDK how to make MSVC build to collect the coverity data in a way CodeCov would udnerstand (and even then it is likely bad idea to feed CodeCov from multiple data sources). Also, IDK whether the build triggering in Travis can somehow be enabeld so that it does the Linux builkd when a PR is created/updated.

So it likely "fixes" itself when we eventually merge the PR.

mity commented 5 years ago

The latest commit is not enough. See https://github.com/mity/md4c/pull/87/commits/cc861265b5bc68b87b223c223d7ecefddda34a38#diff-7a78427587ec282a7531fcfa87d134bcR4155 and https://github.com/mity/md4c/pull/87/commits/cc861265b5bc68b87b223c223d7ecefddda34a38#diff-7a78427587ec282a7531fcfa87d134bcR4155

The both line now need to fire text_type, not the hard-coded MD_TEXT_CODE.

dyedgreen commented 5 years ago

The latest commit is not enough. See cc86126#diff-7a78427587ec282a7531fcfa87d134bcR4155 and cc86126#diff-7a78427587ec282a7531fcfa87d134bcR4155

The both line now need to fire text_type, not the hard-coded MD_TEXT_CODE.

Just saw that, should be fixed now πŸ˜…

dyedgreen commented 5 years ago

Just saw the unicode whitespace -.- sorry for that...

mity commented 5 years ago

Just saw the unicode whitespace -.- sorry for that...

Sorry. Not sure what you are talking about. Maybe because you force-pushed again. But then the previous version is more or less inaccessible to me.

You can freely add new (fixing) commits instead of force-pushes: I will likely "Squash and merge" the PR into a single commit as it all is not that big change anyway and it all is closely related.

dyedgreen commented 5 years ago

Just saw the unicode whitespace -.- sorry for that...

Sorry. Not sure what you are talking about. Maybe because you force-pushed again. But then the previous version is more ore less inaccessible to me.

You can freely make new commits instead of force-pushes: I will likely "Squash and merge" the PR into a single commit as it all is not that big change anyway and it all is closely related.

No worries, there was an accidental unicode whitespace instead of a normal one and it didn't compile under windows.

mity commented 5 years ago

Looks good to me now.

I'll run some fuzz testing against it during the night, and if it doesn't hit anything I would merge tomorrow or on Monday.

dyedgreen commented 5 years ago

Sounds good πŸ‘

mity commented 5 years ago

Just noticed scripts/run-tests.sh passes outdated option --flatex instead of --flatex-math. Also perhaps keep the pathological tests as the last ones. Those are kind of different then all the other ones.

dyedgreen commented 5 years ago

Should be fixed now. Sorry for that, there was a lot of back and forth πŸ˜…

mity commented 5 years ago

No problem.

mity commented 5 years ago

Playing with it right now a little bit and noticed this:

$ printf '`a\`' | ./md2html/md2html --flatex-math
<p><code>a\</code></p>
$ printf '$a\$' | ./md2html/md2html --flatex-math
<p>$a$</p>

I.e. backslash in a code span and in the math span behaves differently. Is it intended?

(If not, parsing $ would have to be done earlier before escape sequences; at the same time as code span. But it would require quite a big overhaul because those are detected during text scan in md_collect_marks()).

Also

$ printf '$*foo*$' | ./md2html/md2html --flatex-math
<p><equation><em>foo</em></equation></p>

Is that intended? (If not, then using md_rollback() should use MD_ROLLBACK_ALL to disable all Markdown stuff inside the span; not just ranges crossing the opener or closer marks.)

dyedgreen commented 5 years ago

Playing with it right now a little bit and noticed this:

$ printf '`a\`' | ./md2html/md2html --flatex-math
<p><code>a\</code></p>
$ printf '$a\$' | ./md2html/md2html --flatex-math
<p>$a$</p>

I.e. backslash in a code span and in the math span behaves differently. Is it intended?

This is correct, as it is an escaped $, which should not be interpreted as an equation delimiter. (There are also no valid LaTeX equations that end in \.)

Also

$ printf '$*foo*$' | ./md2html/md2html --flatex-math
<p><equation><em>foo</em></equation></p>

Is that intended? (If not, then using md_rollback() should use MD_ROLLBACK_ALL to disable all Markdown stuff inside the span; not just ranges crossing the opener or closer marks.)

That's wrong (although * is also unusual, hence why it slipped me). I'll update the code to do this correctly.

mity commented 5 years ago

Also I wonder whether $ should not be parsed with higher priority then most other Markdown stuff. Consider that math can contain quite a lot of chars which is otherwise meaningful to the Markdown parser.

E.g. * as multiply mark and not Markdown emphasis. Or [] as some matrix/vector indices or math parenthesis of larger importance then () and not Markdown link. Or <> as a comparison sign instead of an autolink. Etc.

This would mean to move $ from the line 3894 and add it e.g. to 3861. It would still not fix it completely. The comment at 3859 is not true anymore: Some of the stuff (code spans, autolinks, raw HTML) has been recently moved directly to the md_collect_marks() as mentioned in my previous comment.

mity commented 5 years ago

Do you have some set of more complex LaTeX math stuff which we could use to test by embedding in some Markdown paragraph which use extensively some links/emphasis etc. to see how to behaves?

dyedgreen commented 5 years ago

Also I wonder whether $ should not be parsed with higher prioroty then most other Markdown stuff. Consider that math can contain quite a lot of chars which is otherwise meaningful to the Markdown parser.

I think there are much fewer chars than you think (have you used LaTeX much before?). Most things are done via commands e.g.:

\times -> *
\frac{a}{b} -> a/b
\left \( -> (
\right \) -> )
...

(for more examples see here and here)

That being said, it still might be useful, since some characters are very frequent like _^\. You are obviously much more familiar with the internal workings of the parser, so moving it to higher priority may be a sensible thing to do. (Although I'm not sure how that would have to be done)

dyedgreen commented 5 years ago

Do you have some set of more complex LaTeX math stuff which we could use to test by embedding on some Markdown paragraph which use extensively some links/emphasis etc. to see how to behaves?

These guys have some tests, although they are embedded into JavaScript documents. Another set of tests is here, this time embedded into html. I'm not sure how easy it would be to extract these.

mity commented 5 years ago

have you used LaTeX much before?

Many years ago and forgot it completely...

... moving it to higher priority may be a sensible thing to do. (Although I'm not sure how that would have to be done)

In short, by removing $ from the line 3894 and add it e.g. to 3861.

Longer answer is that earlier the given mark is handled, it has higher priority. Because in subsequent steps, the md_analyze_marks() skips contents of already resolved ranges. I.e. when it reaches resolved opener mark, the function skips to the corresponding closer mark and continues from there. (Only contents of links/images is later done again from scratch, sort of as a standalone paragraph, with few exceptions.)

If $ would have lower priority then e.g. *, then $blah blah*$* would become $blah blah<em>$</em> because md_analyze_marks() would not see the 2nd $ at all because the emph. span resolved before it would hide it.

Similarly, if $ would have higher priority then *, then $blah blah*$* would become <math>blah blah*</math>*.

And if $ and * have both the same priority as it is now, then it depends on what's resolved earlier by scanning the marks from left to right, as they appear in the text, i.e. *$*$ would be <em>$</em>$ but $*$* would be <math>*</math>*.

dyedgreen commented 5 years ago

In short, by moving $ from the line 3894 and add it e.g. to 3861.

Longer answer is that earlier the given mark is handled, it has higher priority. Because in subsequent steps, the md_analyze_marks() skips contents of the resolved ranges. I.e. when it reaches resolved opener mark, the function skips to the corresponding closer mark and continues from there. (Only contents of links/images is later done again from scratch, sort of as a standalone paragraph, with few exceptions.)

If $ would have lower priority, then e.g. *, then $blah blah*$* would become $blah blah<em>$</em>.

And if $ and * have both the same priority as it is now, then it depends on what's resolved earlier by scanning the marks from left to right, as they appear in the text.

Right. So a very simple change. I still don't know if it is necessary though, from what I can tell, the kind of cases where the symbols mix (which is mostly _<>) are fine:

$ printf '$a_b+c$ bla bla  _ bla bla' | ./md2html/md2html --flatex-math
<p><equation>a_b+c</equation> bla bla  _ bla bla</p>
$ printf '$a*b+c$ bla bla  * bla bla' | ./md2html/md2html --flatex-math
<p><equation>a*b+c</equation> bla bla  * bla bla</p>
$ printf '_ some text $a_b+c$ bla bla' | ./md2html/md2html --flatex-math
<p>_ some text <equation>a_b+c</equation> bla bla</p>
$ printf '< some text $a_b>c$ bla bla' | ./md2html/md2html --flatex-math
<p>&lt; some text <equation>a_b&gt;c</equation> bla bla</p>

(The last case should not escape the inner >, but that is done at the renderer level, and since the example render does not really support LaTeX, I don't see a problem there.)

mity commented 5 years ago

Ok. We can leave that as it is now, and return to it if some problem is encountered.

dyedgreen commented 5 years ago

Yes, that's my thinking too. Especially since this is an internal detail, it can be patched later if anyone discovers a nasty edge-case.

mity commented 5 years ago

(The last case should not escape the inner >, but that is done at the renderer level, and since the example render does not really support LaTeX, I don't see a problem there.)

HTML renderer has to escape it or it would be invalid HTML. An alternative would be instead to wrap the math in CDATA section without any escaping:

<equation><![CDATA[

...here we can now use anything, without the need of escaping...

]]></equation>
dyedgreen commented 5 years ago

Yes. I probably wasn't very clear here. I meant that the md4c parser should not do any escaping here, so that a consumer of the api can correctly render the LaTeX. But since the escaping is done at the level of the HTML renderer it is correct.

mity commented 5 years ago

Merged. Thanks for all the work.

dyedgreen commented 5 years ago

Thank you for bearing with me through all the changes 😊