innovationOUtside / ou-jupyter-book-tools

Tools for working with Jupyter Books
MIT License
2 stars 1 forks source link

Pre-processing documents in Sphinx Pipeline #5

Open psychemedia opened 2 years ago

psychemedia commented 2 years ago

I haven't found an obvious preprocessor pipeline for treating documents before Sphinx gets to work on them, but we could create a package to act as a custom handler that will do the preprocessing steps.

For example, Jupyter Book docs suggest:

# The string should be a Python function that will be loaded by: import nbformat.reads
# (additional parameters can then also be passed in)
# The function should take a file’s contents (as a str)
# and return an nbformat.NotebookNode
# https://nbformat.readthedocs.io/en/stable/api.html
sphinx:
  config:
    nb_custom_formats:
        .ipynb:
            - nbformat.reads
            - as_version: 4

I think we can pass extra parameters in the same way as as_version.

The following fragment may also be handy for coding up a notebook preprocessor (the sphinx loader should use nbformat.from_dict(json.loads(nb_body)) as the response, perhaps?)

# Via: https://stackoverflow.com/a/58574138/454773

import json
from traitlets.config import Config
from nbconvert import NotebookExporter
import nbformat

c = Config()
c.TagRemovePreprocessor.enabled=True # Add line to enable the preprocessor
c.TagRemovePreprocessor.remove_cell_tags = ["del_cell"]
c.preprocessors = ['TagRemovePreprocessor'] # Was previously: c.NotebookExporter.preprocessors

nb_body, resources = NotebookExporter(config=c).from_filename('notebook.ipynb')
nbformat.write(nbformat.from_dict(json.loads(nb_body)),'stripped_notebook.ipynb',4)

Docs for creating a custom nbconvert preprocessor here.

The example given is to create the preprocessor class:

from traitlets import Integer
from nbconvert.preprocessors import Preprocessor

class PelicanSubCell(Preprocessor):
    """A Pelican specific preprocessor to remove some of the cells of a notebook"""

    start = Integer(0,  help="first cell of notebook to be converted").tag(config=True)
    end   = Integer(-1, help="last cell of notebook to be converted").tag(config=True)

    def preprocess(self, nb, resources):
        self.log.info("I'll keep only cells from %d to %d", self.start, self.end)
        nb.cells = nb.cells[self.start:self.end]
        return nb, resources

There also looks to be a preprocess_cell method that can be overwritten; presumably, each cell is iterated through and you can then process each cell in turn.

Then configure the pipeline:

# Create a new config object that configures both the new preprocessor, as well as the exporter
c =  Config()
c.PelicanSubCell.start = 4
c.PelicanSubCell.end = 6
c.RSTExporter.preprocessors = [PelicanSubCell]

# Create our new, customized exporter that uses our custom preprocessor
pelican = RSTExporter(config=c)

# Process the notebook
print(pelican.from_notebook_node(jake_notebook)[0])
psychemedia commented 2 years ago

For processing rich md, such as tagged md cells readable by Jupytext format, we could have a custom .md handler to read the doc into a notebook format using jupytext, then preprocess.

sphinx:
  config:
    nb_custom_formats:
        .Rmd:
            - jupytext.reads
            - fmt: Rmd