jupyterlab / benchmarks

Benchmarking tools for JupyterLab
https://jupyterlab-benchmarks.readthedocs.io
BSD 3-Clause "New" or "Revised" License
12 stars 16 forks source link

First markdown parser tests #97

Open fcollonval opened 2 years ago

fcollonval commented 2 years ago

Discussion about Markdown parsing is a long standing issue. This PR lays out a structure to test nbconvert and web frontend parser based on the GitHub flavored Commonmark tests.

Most of those tests are failing

Xref: https://github.com/jupyterlab/jupyterlab/issues/272

fcollonval commented 2 years ago

Part of the errors are due to the additional id and link nbconvert and JupyterLab are adding on headings compared to gfm commonmark; e.g. '<h3 id="foo">foo<a class="anchor-link" href="#foo">&#182;</a></h3>' VS '<h3>foo</h3>'

fcollonval commented 2 years ago

Adding the normalization of GFM on the output HTML the results for JupyterLab is 293 failed, 378 passed

The artifact to compare the outputs is available at: https://github.com/jupyterlab/benchmarks/actions/runs/2200787289

fcollonval commented 2 years ago

So a more in-depth analysis shows the following group of discrepancies:

id section markdown commonmark-gfm JupyterLab
10 Tabs #\tFoo\n <h1>Foo</h1> <h1 id="Foo">Foo<a class="jp-InternalAnchorLink" href="#Foo" target="_self">¶</a></h1>
112 Fenced code blocks `ruby\ndef foo(x)\n return 3\nend\n\n` <pre><code class="language-ruby">def foo(x)\n return 3\nend\n</code></pre> <pre><code class="cm-s-jupyter language-ruby"><span class="cm-keyword">def</span> <span class="cm-def">foo</span>(<span class="cm-variable">x</span>)\n <span class="cm-keyword">return</span> <span class="cm-number">3</span>\n<span class="cm-keyword">end</span>\n</code></pre>
133 HTML blocks <Warning>\n*bar*\n</Warning>\n <warning> *bar* </warning> *bar*
140 HTML blocks <script type="text/javascript">\n// JavaScript example\n\ndocument.getElementById("demo").innerHTML = "Hello JavaScript!";\n</script>\nokay\n <script type="text/javascript">// JavaScript example document.getElementById("demo").innerHTML = "Hello JavaScript!";</script><p>okay</p> <p>okay</p>
141 HTML blocks <style\n type="text/css">\nh1 {color:red;}\n\np {color:blue;}\n</style>\nokay\n <style type="text/css">h1 {color:red;} p {color:blue;}</style><p>okay</p> <p>okay</p>
id section markdown commonmark-gfm JupyterLab
1 Tabs \tfoo\tbaz\t\tbim\n <pre><code>foo\tbaz\t\tbim\n</code></pre> <pre><code>foo baz bim\n</code></pre>
120 HTML blocks <div>\n *hello*\n <foo><a>\n <div>*hello* <foo><a> <div>*hello* <a rel="nofollow" target="_self"> </a></div>
121 HTML blocks </div>\n*foo*\n </div>*foo* *foo*
fcollonval commented 2 years ago

Actually marked is running the commonmark and gfm tests in its CI. The results as of Jan 4th 2022 are reported there: https://github.com/markedjs/marked/discussions/1202#discussioncomment-1907552

jasongrout commented 2 years ago

Interestingly, Github just added math rendering, so now there is another opinion about exactly what syntax is used to create math: https://github.blog/changelog/2022-05-19-render-mathematical-expressions-in-markdown/

Math is also rendering in github's notebook preview (see https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Lorenz%20Differential%20Equations.ipynb, for example). It appears to use MathJax 3.2.0.

fcollonval commented 2 years ago

@williamstein pointing me out to some concerns about the GitHub implementation: https://nschloe.github.io/2022/05/20/math-on-github.html; they bring interesting points to have in mind for our parser.

fcollonval commented 2 years ago

Let's list the wanted feature for an ideal markdown parser for JupyterLab:

Reference: Notebook documentation

fcollonval commented 2 years ago

WIP Candidates / features matrix

marked.js markdown-it MyST-parser
Support gfm x[^1] x[^4] x
Math syntax x[^2] ? x
Attachment x[^3] ? ?
Extensible x ?
MyST x

[^1]: Partly true - see test results [^2]: Using some pre processing [^3]: Using customized link handler [^4]: CommonMark run as part of the CI - GFM features available as plugins

Some comments:

What other are using?

williamstein commented 2 years ago

@fcollonval a few days ago I rewrote the upstream markdown-it plugin I'm using for parsing out math, so in cocalc we fully parse math via a plugin, rather than some sort of hack involving parsing before or after markdown is used (like github and jupyter both do, I think). Here's the code:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

It's MIT licensed. My goal with that code is to align with upstream Jupyter in fidelity in terms of what is parsed as math. *In cases where there is a reasonable difference, I would lobby for Jupyter to change. As an example, my plugin parses this properly as inline math:

consider \begin{math}x^3\end{math} and ...

JupyterLab doesn't detect it as math at all. I think it's reasonable to detect.

jasongrout commented 2 years ago

It seems that https://github.github.com/gfm/ has not been updated for math support. Edit: which is what @fcollonval was saying above

jasongrout commented 2 years ago

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

JasonWeill commented 2 years ago

New Markdown parsers should also address existing Markdown bugs and feature requests, such as:

williamstein commented 2 years ago

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

It's by definition exactly what this file parses:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

when run as the first plugin in markdown-it. It would be a lot of (very valuable) work for that to get converted to an official spec. My goal with writing and iterating on math-plugin.ts has been to get fidelity with what I think JupyterLab does or should do, and I've incorporated significant feedback from my users. I would not be at all surprised if there are significant bad surprises related to the above linked code though. In fact, I can't wait to test it on the bugs @jweill-aws just listed, and see if my code isn't all broken or not on those...

williamstein commented 2 years ago

I wrote up some thoughts in a README here along with a notebook testing the issues mentioned above:

https://cocalc.com/wstein/support/markdown-math

williamstein commented 2 years ago

There's a related discussion about math + markdown here: https://chat.zulip.org/#narrow/stream/2-general/topic/LaTeX.20math/near/1382932