A book publishing project interested in extending jupytext code

AakashGfude commented 4 years ago

Hey there @mwouts - I wanted to introduce myself and a team that I am working with.

We are a group of academic researchers who are working on a tech stack to build open, reproducible documents with Python. We've set up a GitHub organization to host the projects that we're working on as a part of this project: https://github.com/ExecutableBookProject.

To better equip ourselves and the community in writing complex documents, we are also building a new markup text format, called myst :- https://github.com/ExecutableBookProject/myst , that basically tries to combine the extensibility and strong semantic markup properties of reStructuredText with some features of Markdown.

Now, to do a seamless conversion between myst and ipynb, we thought of extending your amazing tool to include myst format. And before we write any code, it would be great if you can give us any heads up or ideas/suggestions on doing this properly.

Thanks again for this great project!

mwouts commented 4 years ago

Hello @AakashGfude , thanks for reaching out! I'd be happy to help.

Tell me a bit more about the myst format, and how you see the mapping to a Jupyter notebook: How do you represent a code cell in myst? Is there a cell marker in myst to identify two consecutive text cell? How do you plan to encode the cell metadata?

mmcky commented 4 years ago

thanks @mwouts. We are working on putting together a more detailed spec that documents the mappings between myst and ipynb format. We have added cell delimiters in myst such as +++ (all subject to change at this stage). Once the spec is put together we will certainly share it with you for thoughts and comments.

Our aim is to get exact two-way representations with the vision we can swap between human readable (text based format) and the machine readable (notebook format) as a mirror format.

mwouts commented 4 years ago

Sounds great! I am looking forward to reading that.

mmcky commented 4 years ago

@mwouts while we are working on the myst <-> ipynb spec, I had a question regarding the workflow / architecture of jupytext.

We are interested in the possibility of setting up the ipynb / text-based-format conversion (for those that have lossless conversions) to have realtime updating. The reason for this is we have found when working on larger projects that maintain source files in text format users typically want to build the notebooks for running the code blocks and inevitably edit the notebook and forget to transfer those edits to the text files. Do you think a mirroring two-way communication between formats would be achievable? We are happy to work on this -- but wanted to check with you first on your thoughts on this.

phaustin commented 4 years ago

@mmcky -- how would this differ from the current jupytext.TextFileContentsManager?

mmcky commented 4 years ago

thanks @phaustin -- that's neat! I hadn't realised you can open the md file directly through the Jupyter interface and it represents that as the ipynb file on the fly using a context manager. I had thought jupytext was mainly built around companion files ipynb and md and keeping them in sync through save actions.

I had assumed the workflow would be to open text-based file in an editor and an ipynb file in jupyter and use a file watcher to keep both in sync in real-time so you could edit in either location and each format updates. But opening the md file directly through Jupyter is a neat way to handle this issue as a save in Jupyter will alter the md.

The only confusing part to me is if you open an md file -- it seems to create an ipynb file by default. I would have thought if you open an md file you just want the translation to ipynb on the fly and keep md as the source of truth. {Update: Oh I see -- that is just default behaviour in jupytext with notebook autosave and pair with notebook enabled}

mwouts commented 4 years ago

Hello! Yes I agree with @phaustin, a proxy for real time sync is implemented in the ContentsManager. You've seen how it works, right? When you save the document, all its representations are written on disk (e.g. ipynb and md when you use a paired notebook, or md only if you opened a md document with no pairing information), and when you reload the notebook, the inputs cells are taken from most recent text file, and joined with the outputs of the ipynb file, if any, using the fonction combine_inputs_with_outputs from combine.py.

The only confusing part to me is if you open an md file -- it seems to create an ipynb file by default.

When you open a md file with no pairing information in Jupyter, the content manager does return a document with type notebook (using jupytext.read). However no ipynb file is created.

mwouts commented 4 years ago

Note that actually, I would be interested in going one step closer to real realtime updating. For me, the difference with the current behavior would be the following:

when the user edits the md file on disk, the notebook is updated automatically, without having to reload it.
the outputs are preserved in the browser, not in an (optional) ipynb file. Thanks to this a) sync is faster since we never read the ipynb file on disk for this realtime sync - only the text files, which are way lighter b) if the notebook is md only, we don't lose the outputs when the notebook is updated (currently, when you reload a md only notebook, outputs are lost)

This real realtime sync is being discussed at #406, and will require a good understanding of the JS/TS part of Jupyter, together with a port of the combine_inputs_with_outputs function to these languages.

mwouts commented 4 years ago

Now if we come back to your initial question, how to extend Jupytext to another format, I suggest that you have a look at how the .Rmd format is implemented, starting with formats.py. That format derives from .md, with a few changes in how the text files are parsed. Note that I am not particularly proud of the implementation - you may found exceptions based on the file extension here and there... - but at least it works! And I can also contribute a POC for the implementation of your format when you're done with the specs.

Also, the test framework is very important in Jupytext, since it can help you make sure that roundtrip really work. So, for your new myst format, a) You could add a series of test on simple notebooks - seek inspiration in e.g. test_read_simple_markdown.py. b) You coud duplicate these lines in test_mirror.py and replace md with myst. This will test your new myst format for the roundtrip on a series of challenging notebooks that we have collected over time.

mmcky commented 4 years ago

thanks @mwouts your comments about the realtime sync sound great. I will follow #406.

I suggest that you have a look at how the .Rmd format is implemented, starting with formats.py.

Thanks for the the guidance. That sounds like a good entry point and we would want to do something very similar. I suggest then we will work in a fork of jupytext to get the myst syntax working and once the spec has settled down we can upstream the new format.

roundtrip is really important to us. so thanks for the guidance on testing too. Super helpful.

cc: @AakashGfude

choldgraf commented 4 years ago

hey @mwouts 👋 didn't realize this thread was going on! Just FYI this is the project that we were emailing about a few weeks back! Let me know if I can help move the conversation forward!

Per your earlier questions about how the notebook structure would be represented in MyST - this is our latest thinking:

https://github.com/ExecutableBookProject/MyST-NB/issues/12#issue-567866971

We'd love to hear your thoughts on the proposal there!

mwouts commented 4 years ago

By the way, I realise that, if you already have a two way converter myst <-> ipynb, you can plug it directly into the jupytext.reads/writes functions. An example of this is the pandoc format, for which we simply call pandoc:

https://github.com/mwouts/jupytext/blob/dbc6012d35a93d74ca915cc17365ebda139f3022/jupytext/jupytext.py#L54-L55

If you decide to take that route, you may

add your converter as an optional dependency to Jupytext
add an if self.fmt.get('format_name') == 'myst': in both jupytext.reads and jupytext.writes
declare the myst format in formats.py (and import the myst format version number from your package)
and activate the round trip tests for the new myst format, as discussed above.

This will provide the same functionality (i.e. support of myst on the command line and in the contents manager), and may be easier to develop or maintain.

mwouts / jupytext

A book publishing project interested in extending jupytext code #447