mwouts / jupytext

Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
https://jupytext.readthedocs.io
MIT License
6.66k stars 386 forks source link

Losing code cell language from MyST markdown inputs with `--pipe black` #1267

Open davidorme opened 3 months ago

davidorme commented 3 months ago

We're using jupytext --pipe black to automatically format Python code in MyST markdown notebooks (as part of pre-commit, but I don't think that's relevant here). The problem we're having is that (IIUC) the round trip through the percent format to pass it to black strips out the language specification on the code-cell directives. So given:

---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Quantum yield efficiency of photosynthesis

```{code-cell} python
# I'm some code
x = 1

Running that through `jupytext --pipe black` results in:

````md
---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Quantum yield efficiency of photosynthesis

```{code-cell}
# I'm some code
x = 1


That does affect other tools that rely on the language specification for syntax highlighting of code cells - we're using VSCode. I  wondered if this might be tackled by setting `cell_metadata_filter = "all"` but I think that language specification is not part of the cell metadata? I don't think that any of the other settings in [config.py](https://github.com/mwouts/jupytext/blob/main/src/jupytext/config.py) tackle this?
mwouts commented 3 months ago

Hi @davidorme , thank you for reporting this! We would need to make sure that this language information is preserved when the notebook is converted to a Jupyter notebook (the py:percent format will then, in turn, preserve the cell metadata).

Let me check with @chrisjsewell who knows that part better than I do, what happens to that language specification when the conversion occurs.

chrisjsewell commented 3 months ago

Will put it on the todo list to have a look 😅 but feel free to ping me again if I don't reply

davidorme commented 1 month ago

@chrisjsewell Sorry to ping you on this.

I've got jupyter-lab and jupytext --pipe black playing ping-pong with each other. When I'm writing docs in jupyter as Myst Markdown files, those language tags are automatically added when the file saves (I'm assuming that this is something that jupytext does?). But then when I commit the file, the pre-commit setup using jupytext --pipe black throws them all out again 😄.

It's not a huge deal - we're just only committing files stripped of code-cell language information - but it would be good to fix it.

mwouts commented 1 month ago

Oh actually I realize that this is an issue that has been going on for a very long time! See #759, #778, #789.

What happens is that the language specification on the code cell comes from the language_info notebook metadata.

That information is in the notebook when you save it from Jupyter, but it is lost when you read the MyST file.

I see one immediate workaround: add the language_info metadata to your MyST notebooks by adding this to your jupytext.toml config:

notebook_metadata_filter="language_info"

On the longer term, I see two possible fixes:

  1. Apply the metadata filter before passing the notebook to MyST (e.g. if Jupytext is not configured to preserve the language info, then no cell would get the ipython3 lexer)
  2. Or, reconstruct the language_info within Jupytext, e.g. figure out how Jupyter does that, and do the same

My preference goes to 1 but I am curious to hear yours @chrisjsewell @davidorme @parmentelat

davidorme commented 1 month ago

I may have got this wrong but I have pyproject.toml with:

[tool.jupytext]
# Stop jupytext from removing mystnb and other settings in MyST Notebook YAML headers
notebook_metadata_filter = """
settings,
mystnb,
language_info
"""

And then a markdown file with YAML headers:

---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

If I run jupytext --pipe black file.md on that then the output reports:

[jupytext] Reading docs/source/users/demography/canopy.md in format md
[jupytext] Executing black -
All done! ✨ 🍰 ✨
1 file left unchanged.
[jupytext] Writing docs/source/users/demography/canopy.md in format md:myst

But all of the code-cell language specifications have been stripped.

mwouts commented 1 month ago

I see! You still don't have a language_info metadata in your MyST file, that's why the pygment lexers go away. To add that metadata to your MyST file, you will have to open it in Jupyter, and save it using the new config file.

davidorme commented 1 month ago

Alright. That took longer than expected:

But. With the config above committed and jupyter started in the project root so it actually reads that config, opening and saving a notebook in jupyter does add the following to the notebook YAML:

language_info:
  codemirror_mode:
    name: ipython
    version: 3
  file_extension: .py
  mimetype: text/x-python
  name: python
  nbconvert_exporter: python
  pygments_lexer: ipython3
  version: 3.11.9

Saving in jupyter also restores the language info to the code cells and now piping the notebook through black does not strip the language info. So the workaround works.

  1. Apply the metadata filter before passing the notebook to MyST (e.g. if Jupytext is not configured to preserve the language info, then no cell would get the ipython3 lexer)
  2. Or, reconstruct the language_info within Jupytext, e.g. figure out how Jupyter does that, and do the same

I don't understand the boundaries between the different packages at all well, but if I understand correctly:

I'm not sure what (1) adds beyond the workaround - does it mean that jupyter stops adding the code-cell lexer info so the notebook content is more stable? It seems like this could just be a documentation update to say that the default behaviour is not to retain lexer information in notebooks, but that adding the language_info back in to the retained metadata will allow lexer information to be retained?

davidorme commented 1 month ago

I think I've run into a workflow that - if I understand correctly - argues for option (2). This usage might be out of scope for jupytext but it feels like a reasonably natural thing to want to do.

The workflow is in creating Myst markdown notebooks for rendering using sphinx. Users can of course create notebook content in juypter but one of the advantages (joys?) of the Myst markdown format is that you don't have to because it is human readable. So:

  1. If I'm working in a code editor, I can create a new markdown file that I want to be a Myst notebook.
touch simple.md
  1. I can then set up the header YAML.
jupytext --set-format md:myst --set-kernel python3   simple.md 
  1. I've now got a file that I can use with myst-nb in sphinx to generate content.

  2. But - if I've got this right - at present, the language_info metadata will only be inserted if I open the file in jupyter and then save it, having set the notebook metadata filter to preserve the language_info metadata.

  3. So in this use case, my simple.md file will only be able to preserve the language on code-cell blocks if I open and save it through jupyter.

That feels clunky. I get that jupytext is intended to primarily act as an interface with jupyter but with this workflow jupyter isn't really needed. If I understand right, your proposal (2) would allow jupytext to set the language_info in the same way that it sets format and kernel?