Talk about MyST Markdown and Jupyter Markdown Notebooks with the Notebook Format meeting

choldgraf commented 1 year ago

I wasn't sure where was the best place to ping you all, so I figured I'd just put it here since this issues probably semi-relevant to the discussions. But ping @chrisjsewell @rowanc1 @stevejpurves @agoose77 @sylvaincorlay and @nthiery

There are a bunch of people meeting in Paris right now to discuss potential foundations, constraints, etc for a markdown-based version of Jupyter Notebooks. I had a quick chat with @sylvaincorlay about this and he said that they'd looked at myst and thought it was very close to what would be needed, with a few differences. We discussed a few potential outcomes, but I think our goal could be to find a compromise in MyST syntax that would be acceptable to serve as a "Canonical Jupyter Notebook Markdown Format". It might be a subset of all the syntax MyST supports, but figuring that out is something that is probably best done via live conversation.

I'm pinging you all just because I know people are thinking and discussing this right now at the Jupyter Formats workshop, so wanted to signal-boost it in case you all wanted to organize a chat. I won't be able to attend because I am still super sick and I have a 6 day old infant 🙃 . But consider yourselves pinged!

agoose77 commented 1 year ago

Thanks @choldgraf! We're definitely thinking about how all of this fits together, and it's a thorny problem! Will keep you posted.

Congratulations!!! Hope things are as easy as they can be, and that you get well soon!

chrisjsewell commented 1 year ago

Thanks @choldgraf, ermm I'd be happy to listen in remotely. Is there any provision for this?

(Was thinking to attend myself, but it was trumped by the fact that I'm off to a conference in Las Vegas next week 😝)

agoose77 commented 1 year ago

I'm attending remotely, let me ping @fcollonval to see about late admission!

SylvainCorlay commented 1 year ago

Thanks for launching this discussion @choldgraf!

We looked into the myst-notebook format documented on the Jupyterbook website, as well as some other ideas that came up in the Jupyter Community Workshop on text-based notebook formats.

If we want to make such a format an "officially supported format" by Jupyter, key requirements would be:

(1) We should be able to convert notebooks in both directions losslessly meaning that anything that can be done in the ipynb format (metadata, output types) should be expressable in the new format.
(2) ipynb will probably never die as there are tens of millions of such files on GitHub.com and probably many many more in other places. If we enable a new feature in the new text-based format, there should be a way to include it in future iterations on the ipynb format.

Namespacing admonitions

Now, another looser requirement would be that it should be reasonable for third-parties to support this format (e.g. native rendering of such notebook files by GitHub). On this front, the code-cell admonition seems to be too "top-level" to be adopted. One way we could get away with this would be to namespace admonitions. It would be a much easier ask for e.g. GitHub to implement our spec in jupyter-namespaced admonitions than to add a large number of top-level admonitions to their supported spec.

```{jupyter:code}
:execution_count: 1
1 + 1


☝️ Example input cell. Execution count is not mandatory, but a renderer of `jupyter:code` would know what to do with it.

````markdown
```{jupyter:output}
:execution_count: 1
:output_type: execute_result
{ "text/plain" : "2" }


☝️ Corresponding output cell, the result of the execution. `output_type` is required. Mime bundle is raw JSON so that it can be used as-is by renderers without any processing.

````markdown
```{jupyter:code}
:execution_count: 2
print(1 + 1)


☝️ Similar input cell, but using a print statement instead of a mime bundle.

````markdown
```{jupyter:output}
:output_type: stream
2



☝️ Corresponding output cell, the result of the previous one. `output_type` is required, but this is a stream. Execution count is never included in stream output.

**GitHub Flavored Markdown**

Another worry is that GFM seems to be moving in the direction of allowing admonitions, but with a slightly different syntax. Are there been any discussions with the folks over at GitHub about possible convergences?

**Highlighting**

One thing that would make the raw textual content of markdown-based notebooks more readable would be to have a nice CodeMirror (6) syntax highlighting mode for Myst that dims the color of yaml frontmatter and shorthand options, so that readers can see the important content more easily.

This is important as notebooks generated by Jupyter user interfaces will have more metadata attached to them (execution count, cell metadata) than what a person would manually type in a markdown document. Proper highlighting in JupyterLab would mitigate this issue.

chrisjsewell commented 1 year ago

Thanks for the update @SylvainCorlay

For what is worth, I would mirror this feeling on GitHub's beta feature: https://github.com/community/community/discussions/16925#discussioncomment-4748880

I would also note, if you want "rich markdown", then https://github.com/jgm/djot (a recent endeavour by the creator of pandoc and member of the commonmark committee) I feel really the best shot at having a truly "rigorous" and standardised syntax. Too that end, for admonitions they use https://htmlpreview.github.io/?https://github.com/jgm/djot/blob/master/doc/syntax.html#div, which is essentially what myst has started to adopt:

rowanc1 commented 1 year ago

Hi @SylvainCorlay, this is awesome. From your code suggestions the only immediate questions I have are (1) how multiple outputs to a single cell are represented; and (2) how you split markdown cells.

Example: I have put together a sketch here, which almost parses as-is in MyST, so it might give you some other ideas.

For (1): all of the examples that you posted only have a single output rather than an outputs list. I think there is a bit of a mismatch with the current spec, e.g. execution_count on each output, which really exists at the cell level (even if it is stored on the output part of the cell as you have suggested). I am not sure of the solution for this, but calling the directive {jupyter:outputs} (with a s) and having each output on a line could help?

```{jupyter:outputs}
:execution_count: 2
{ "output_type" : "display_data", "data": ... }
{ "output_type": "error", ...}


For (2): splitting markdown cells, I think this is important to have in the base spec especially if we are going for full reproduction as a serialization format. That needs to encode metadata as well.

We have done this implicitly in the myst notebooks with splitting on code-cells, however it needs to be explicit for markdown-markdown split -- we did that with a "block-break" ([spec](https://myst-tools.org/docs/spec/blocks#specification)) with json metadata. I think this is in-family with your other suggestions.

+++ {"tags": ["tag1"]}



Having a way to store the outputs as well as making cell IDs visible would be a big step up. I think that both of those could be optional of course for serialization, and that opens up a lot of workflows and can integrate with existing tools without too much work!

Really enjoyed the workshop this week, and had fun working with @agoose77 @stevejpurves and others! Looking forward to the next steps!

chrisjsewell commented 1 year ago

```{jupyter:code}
:execution_count: 1
1 + 1


Another discrepancy I would note here, is that I assume this is proposing to store code cell metadata as:

:execution_count: 1
:metadata: {"tags": ["tag1"], "other": "value"}
1 + 1


which is different to the current way:

:tags: ["tag1"]
:other: value
1 + 1


or even:

---
tags:
  - tag1
other: value
---
1 + 1



This is better from a "programmatic"/spec sense, since really directive options are intended to be `str` -> `str` mappings and `code-cell` is the outlier here (being basically `str` -> `str`/`list`/`dict`), which would be nice to fix

However it is possibly less "user-friendly"

stevejpurves commented 1 year ago

@SylvainCorlay Thanks for bringing some more context into the convo here!

I've been at the workshop that last three days and participated in a bunch of the discussions around the text based format both with @SylvainCorlay and more so today the wider group. There is some really positive momentum there and the point at which the initial pre-draft proposal is at the end of today is really nice.

Important point though is that the proposed syntax, whilst boardly the same (and well aligned with MyST) has moved on from that outlines by @SylvainCorlay above... the latest are different at a detail level, so there is probably limited us in scrutinizing what is on this thread in detail, syntax wise.

Before I just speak to some of the points @SylvainCorlay raised above I want to communicate whatI was decided by the group at the end of today's session; probably in the next day an issue will be opened on the https://github.com/jupyter/enhancement-proposals communicating the work one and posting a like to the working document, after it's received the final bit for clean up the group wanted to apply. After that it's the groups plan to have a draft jep PR open by the end of the next week to formally start the process.

So i'd watch out for those events in order to be able to review the whole proposal and discussion around it.

To give my opinion on this and speak to @SylvainCorlay's points:

The group has been pretty pragmatic in trying to define a format that losslessly allows conversation between md and ipynb, sacrificing readability to some degree but maintaining the portability, the self-contained nature of the notebook and still satisfying a number of use cases and requirements. Educational usage, better version control, flexible loading, streaming are all better served by the format then the ipynb.
to my eyes, the proposed format is completely myst compatible with some additional custom directives
The notion of namespace on directives is interesting and useful. It remind me of the standard/convention of vnd._________ in mimetypes so vendors could use vendor.core-directive to borrow the semantic intent of the core directive but still have probably a completely separate custom directive implementation. so :+1: on that one!
GFM Markdown - I am busy testing all the GFM syntax. GFM admonitions are already supported (and parsed by mystjs and rendered by myst-to-react) so they already work beautifully in jupyterlab/myst, there are other gaps though, that are going to be easy to resolve -- watch for an issue on that very soon
Syntax highlighting could be added to jupyterlab/myst for notebooks and for the md file types too!
beyond this, this is a really good use case for usto be looking at how to open up the core to enable additional directives to be added to mystjs via some extension/plugin system too.

Overall it's been a great few days and I think we should aim to contribute to the JEP around this as much as we can.

chrisjsewell commented 1 year ago

Thanks for the update @stevejpurves all sounds fun 😄

The notion of namespace on directives is interesting and useful.

Just to note this is already part of myst, they are known as domains

GFM admonitions are already supported

Indeed. Parsing them isn't so difficult. It's just that I don't feel they should be "core" myst syntax, given that (a) we already have a defined admonition syntax, and (b) the GFM syntax is disputable in that it changes the semantic meaning of blockquote syntax (something I was literally just talking about with @rowanc1 regarding attributes on paragraphs 😅)

SylvainCorlay commented 1 year ago

Responding to @rowanc1

For (1): all of the examples that you posted only have a single output rather than an outputs list

Indeed, we addressed this in discussion on whether to have a single output list, or multiple output directives in sequence. (On the proposal you wrote, note that stream and display_data outputs don't have an execution count.)

Reponding to @chrisjsewell

Another discrepancy I would note here, is that I assume this is proposing to store code cell metadata [...]

Indeed, cell tag are just one type of metadata at the moment.

We could move to move them outside of the main metadata field - but this should be a separate JEP from the textual notebook format, and be done both in both the current ipynb format and the new textual format.

Really, my comments were more about the directives for notebooks entirely in markdown: https://jupyterbook.org/en/stable/file-types/myst-notebooks.html. I think we should absolutely namespace them - and have a discussion on output admonitions. (Maybe better define our future common admonitions before using the general Jupyter namespace in case we converge on a slightly different format).

westurner commented 1 year ago

The notion of namespace on directives is interesting and useful.

Just to note this is already part of myst, they are known as domains

westurner commented 1 year ago

There may be multiple calls to display() in an input cell, and that's why there are multiple distinct outputs in the output cell ipynb nbformat json.
Each object displayed by display() MAY return multiple output representations.
- For example, if obj._repr_mimebundle_() returns text/plain, text/markdown, text/html, and application/ld+json, which output format(s) should the .myst notebook contain?
- Linked Data embedded in Markdown is possible if raww HTML containing RDFa is allowed, or e.g. application/ld+json are added as <script type="application/json"> HTML to the markdown
  - Note about CDATA in XML formats like XHTML but not HTML5:
IPython.display.display does not have a _repr_markdown_, but there is an IPython.display.Markdown with a text/markdown MIME type.

      - `_repr_html_`: return raw HTML as a string, or a tuple (see below).
      - `_repr_json_`: return a JSONable dict, or a tuple (see below).
      - `_repr_jpeg_`: return raw JPEG data, or a tuple (see below).
      - `_repr_png_`: return raw PNG data, or a tuple (see below).
      - `_repr_svg_`: return raw SVG data as a string, or a tuple (see below).
      - `_repr_latex_`: return LaTeX commands in a string surrounded by "$",
                        or a tuple (see below).
      - `_repr_mimebundle_`: return a full mimebundle containing the mapping
                             from all mimetypes to data.
                             Use this for any mime-type not listed above.

Really two practical use cases; from https://github.com/chmp/ipytest/issues/89 :

ipynb > markdown w/ output > mailing list
ipynb > markdown w/ output > README.md

Could the usage examples from Example.ipynb be inlined into the README.md?
https://github.com/chmp/ipytest/blob/main/Example.ipynb
The last time I tried to email a nb with output to a mailing list, IIRC it was easiest to pandoc --from=html --to=gfm than to try and save the input and output cells to Markdown with Jupyter nbconvert or jupytext. (... Why {base64 output etc} is not included in most non-.ipynb notebook representations:
HTML5/RDFa is not XHTML, and inlined HTML should have CDATA and/or must be escaped, which is what nbconvert does when generating HTML from .ipynb JSON. From https://stackoverflow.com/questions/3302648/should-i-use-cdata-in-html5 :
<![CDATA[//><!]]>
``` )

jupyter-book / myst-enhancement-proposals

Talk about MyST Markdown and Jupyter Markdown Notebooks with the Notebook Format meeting #15