First markdown parser tests

fcollonval commented 2 years ago

Discussion about Markdown parsing is a long standing issue. This PR lays out a structure to test nbconvert and web frontend parser based on the GitHub flavored Commonmark tests.

Most of those tests are failing

Xref: https://github.com/jupyterlab/jupyterlab/issues/272

fcollonval commented 2 years ago

Part of the errors are due to the additional id and link nbconvert and JupyterLab are adding on headings compared to gfm commonmark; e.g. '<h3 id="foo">foo<a class="anchor-link" href="#foo">¶</a></h3>' VS '<h3>foo</h3>'

fcollonval commented 2 years ago

Adding the normalization of GFM on the output HTML the results for JupyterLab is 293 failed, 378 passed

The artifact to compare the outputs is available at: https://github.com/jupyterlab/benchmarks/actions/runs/2200787289

fcollonval commented 2 years ago

So a more in-depth analysis shows the following group of discrepancies:

Desired discrepancies:
- Heading id and link
- Code block styling
- HTML sanitization

id	section	markdown	commonmark-gfm	JupyterLab
10	Tabs	`#\tFoo\n`	`<h1>Foo</h1>`	`<h1 id="Foo">Foo<a class="jp-InternalAnchorLink" href="#Foo" target="_self">¶</a></h1>`
112	Fenced code blocks	`ruby\ndef foo(x)\n return 3\nend\n\n`	`<pre><code class="language-ruby">def foo(x)\n return 3\nend\n</code></pre>`	`<pre><code class="cm-s-jupyter language-ruby"><span class="cm-keyword">def</span> <span class="cm-def">foo</span>(<span class="cm-variable">x</span>)\n <span class="cm-keyword">return</span> <span class="cm-number">3</span>\n<span class="cm-keyword">end</span>\n</code></pre>`
133	HTML blocks	`<Warning>\nbar\n</Warning>\n`	`<warning> bar </warning>`	`bar`
140	HTML blocks	`<script type="text/javascript">\n// JavaScript example\n\ndocument.getElementById("demo").innerHTML = "Hello JavaScript!";\n</script>\nokay\n`	`<script type="text/javascript">// JavaScript example document.getElementById("demo").innerHTML = "Hello JavaScript!";</script><p>okay</p>`	`<p>okay</p>`
141	HTML blocks	`<style\n type="text/css">\nh1 {color:red;}\n\np {color:blue;}\n</style>\nokay\n`	`<style type="text/css">h1 {color:red;} p {color:blue;}</style><p>okay</p>`	`<p>okay</p>`

To fix discrepancies:
- Tabulation and new lines characters handling
- [TBC] Unpaired HTML tag

id	section	markdown	commonmark-gfm	JupyterLab
1	Tabs	`\tfoo\tbaz\t\tbim\n`	`<pre><code>foo\tbaz\t\tbim\n</code></pre>`	`<pre><code>foo baz bim\n</code></pre>`
120	HTML blocks	`<div>\n hello\n <foo><a>\n`	`<div>hello <foo><a>`	`<div>hello <a rel="nofollow" target="_self"> </a></div>`
121	HTML blocks	`</div>\nfoo\n`	`</div>foo`	`foo`

fcollonval commented 2 years ago

Actually marked is running the commonmark and gfm tests in its CI. The results as of Jan 4th 2022 are reported there: https://github.com/markedjs/marked/discussions/1202#discussioncomment-1907552

jasongrout commented 2 years ago

Interestingly, Github just added math rendering, so now there is another opinion about exactly what syntax is used to create math: https://github.blog/changelog/2022-05-19-render-mathematical-expressions-in-markdown/

Math is also rendering in github's notebook preview (see https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Lorenz%20Differential%20Equations.ipynb, for example). It appears to use MathJax 3.2.0.

fcollonval commented 2 years ago

@williamstein pointing me out to some concerns about the GitHub implementation: https://nschloe.github.io/2022/05/20/math-on-github.html; they bring interesting points to have in mind for our parser.

fcollonval commented 2 years ago

Let's list the wanted feature for an ideal markdown parser for JupyterLab:

Parse GitHub-flavored CommonMark syntax
- This does not covered late addition of mermaid-js nor their way of supporting math equations.
Support Math syntax (to be specified)
Cell attachments
[TBC] Extensible by JupyterLab extensions
[TBC] MyST support (see JupyterLab survey analysis)

Reference: Notebook documentation

fcollonval commented 2 years ago

WIP Candidates / features matrix

	marked.js	markdown-it	MyST-parser
Support gfm	x[^1]	x[^4]	x
Math syntax	x[^2]	?	x
Attachment	x[^3]	?	?
Extensible		x	?
MyST			x

[^1]: Partly true - see test results [^2]: Using some pre processing [^3]: Using customized link handler [^4]: CommonMark run as part of the CI - GFM features available as plugins

Some comments:

MyST-parser provide opinionated markdown-it plugins

What other are using?

VS Code: markdown-it
Cocalc: markdown-it

williamstein commented 2 years ago

@fcollonval a few days ago I rewrote the upstream markdown-it plugin I'm using for parsing out math, so in cocalc we fully parse math via a plugin, rather than some sort of hack involving parsing before or after markdown is used (like github and jupyter both do, I think). Here's the code:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

It's MIT licensed. My goal with that code is to align with upstream Jupyter in fidelity in terms of what is parsed as math. *In cases where there is a reasonable difference, I would lobby for Jupyter to change. As an example, my plugin parses this properly as inline math:

consider \begin{math}x^3\end{math} and ...

JupyterLab doesn't detect it as math at all. I think it's reasonable to detect.

jasongrout commented 2 years ago

It seems that https://github.github.com/gfm/ has not been updated for math support. Edit: which is what @fcollonval was saying above

jasongrout commented 2 years ago

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

JasonWeill commented 2 years ago

New Markdown parsers should also address existing Markdown bugs and feature requests, such as:

https://github.com/jupyterlab/jupyterlab/issues/12524 (LaTeX overlapping)
https://github.com/jupyterlab/jupyterlab/issues/12561 / https://github.com/jupyterlab/jupyterlab/issues/272 (URLs with spaces)
https://github.com/jupyterlab/jupyterlab/issues/12432 (ordered list with letter or Roman numeral ordinals)

williamstein commented 2 years ago

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

It's by definition exactly what this file parses:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

when run as the first plugin in markdown-it. It would be a lot of (very valuable) work for that to get converted to an official spec. My goal with writing and iterating on math-plugin.ts has been to get fidelity with what I think JupyterLab does or should do, and I've incorporated significant feedback from my users. I would not be at all surprised if there are significant bad surprises related to the above linked code though. In fact, I can't wait to test it on the bugs @jweill-aws just listed, and see if my code isn't all broken or not on those...

williamstein commented 2 years ago

I wrote up some thoughts in a README here along with a notebook testing the issues mentioned above:

https://cocalc.com/wstein/support/markdown-math

williamstein commented 2 years ago

There's a related discussion about math + markdown here: https://chat.zulip.org/#narrow/stream/2-general/topic/LaTeX.20math/near/1382932

jupyterlab / benchmarks

First markdown parser tests #97