CZI EOSS timeline, planning and report.

This is a trimmed down version of the timeline and goal we planned in the original grant for better public tracking.

We propose overhauling the Jupyter and IPython interactive documentation framework with many features (inline graphs, navigation, indexing) while providing access to content (tutorials, how-to’s, examples, gallery) currently only accessible through hosted documentation websites. Building a better understanding of the Python Ecosystem’s documentation convention into IPython and Jupyter will also augment those capabilities with many desirable features, like local search, indexing, cross-references, and many others for a best-in-class documentation experience.

Python has multiple stories for documentation: docstrings, narrative documentation built via Sphinx and hosted online, Dynamic Tutorials by downloading notebooks. However, despite the diversity, none of these offer a complete documentation experience.

When interactively exploring data in IPython/Jupyter powered tools, the question-mark operator is the typical entry point to access documentation. This has many advantages: showing the documentation without the user having to search for the right page, or know the types of objects. However, it lacks the richness of hosted documentation: limited to text, no images, navigation, links or indexing. It is limited to docstrings and cannot expose tutorials and narrative sections critical for discovery and understanding. Users are exposed to raw source code containing LaTeX equations and Restructured Text directives, making for a poor user experience and lack of accessibility.

While hosted documentation is better in some of these aspects, it is often scattered across the web, does not reflect versions of libraries installed by users, and is often shadowed in search engine results by poorly maintained click-bait articles, leading to confusion and poor coding habits among practitioners.

Library authors are constrained in their technical writing to decide whether to prioritize interactive session documentation or hosted versions, leading to long-standing debates (Sympy’s syntax for equations), or complex and costly workarounds (Matplotlib, Napari and Pandas dynamic docstring generation at runtime).

Via a reusable framework we call Papyri, included in IPython/Jupyter, we can offer a state-of-the-art documentation experience to end-users. Our current proof of concept allows library authors to publish a semantic Intermediate Representation Documentation format (IRD). On users’ machines, tools can leverage IRD to provide access to the Python Ecosystem documentation’s full richness. Our prototype shows that the following features are in reach:

Documentation from within IPython/Jupyter with rich text, images, and rendered mathematics.
Access to narrative sections, tutorial, examples, and image gallery.
Seamless integration and navigation across libraries.
Better in-built accessibility features, and the ability to customise users’ preferences.
Ensure documentation matches the user's installed libraries version.
Avoid dynamic docstring generation and their performance impact on libraries.

We believe the above is a first step to enhance the documentation experience for both consumers and authors. This project represents the key for the development, quality, ease of use, and discoverability in a growing Python ecosystem.

Additionally, this framework will open the door to several other valuable features, such as allowing docstrings to be written in the widely-used markdown format, better configuration of end-user appearance and preferences, translations and domain-specific alternatives, indexing, and others.

There are three technical components that need to be addressed. 1) Providing the tools to generate IRD from library source code 2) Installing and rendering IRD on users’ machines, and 3) Uploading and distributing IRD files. For this proposal we request funding the first two. The last one can be achieved by reusing other infrastructures like GitHub Pages, GitHub Actions, or a conda-forge-like model.

The key user-facing components of this project require either extensions or changes within IPython and JupyterLab. Developing these as extensions allows a large flexibility in release timeline and allows integration with already released versions, widening the pool of users who can access early prototypes. Once extensions are well-developed and stabilized, those features can be migrated to the core IPython and JupyterLab. The IPython monthly minor releases make it easy to regularly incorporate these improvements to users. We expect one major release of IPython mid 2022, which would be the opportunity to make large changes if necessary. Major versions of JupyterLab are published with a cycle of about 6 months, which give us several opportunities to make the Papyri extension part of the default set of shipped extensions.

Building and publishing of IRD files by libraries can be done after release of the library, therefore roadmaps of other projects we would build documentation for do not affect this project’s schedule.

A significant community investment is also necessary to provide the right models and get adoption across the scientific community. A number of projects are already using Sphinx with various configuration options and specialized extensions for each library. It will be critical to engage with those libraries to make sure the features they currently use and their documentation build processes can be accommodated by Papyri. As this will rely on developing a standard for IRD files to publish and ship documentation to users, agreement across the core Scientific Python ecosystem will need to be reached for the format of IRD files.

Year One:

The first six months will be targeted toward publishing a usable prototype to quickly gather feedback and drive user contribution.

Review the core supported features and critical needs from existing Python libraries for a usable prototype
Implement Parsing of Numpydoc formatted Docstrings
Implement prototype JupyterLab and IPython extensions to render IRD files

Month 6 to 12 will revolve around presenting progress at SciPy to expand adoptions.

Publish initial draft of IRD files for development version of at least 5 core scientific python projects (e.g. SciPy, NumPy, Skimage, Matplotlib, Pandas)
Provide alpha release for early user feedback and adoption
Parse and crosslinking with narrative documentation and examples
Prepare in-person events during Scipy 2022
- Presentation at SciPy (conditional to talk acceptance)
- In Person workshop.
- In person user study. Year Two:

The second year focuses on growth, and extending functionality, which is critical for a self-sustaining project and seeking future sources of funding.

Review of UX and design feedback collected during Scipy
Beta release of extensions, and IRD, most features design API and configuration options considered stable enough for end-users
Publish draft specification of stabilized IRD format

The last six month will be marked by the second presentation at SciPy, stabilisation and release of a first stable as part of IPython and Jupyter.

Presentation at SciPy 2023
Second In-person meeting
Automatically building and publication of IRD for multiple libraries of the Scientific Python Ecosystem
JupyterLab uses IRD when available, and may suggest install/updates of missing IRD files.

Deliverables consist of both implementation and specification of IRD format in order to allow and encourage competing implementation and tooling. This includes:

Specification of an intermediate representation documentation format (IRD)
Extensions or core components for IPython and JupyterLab to render IRD files.
CLI and Library to generate IRD for most library authors.
Registry to publish/install IRD file.
CLI, and Python library to install from above registry.
Automatic building of IRD files for core libraries of the Python Ecosystem.

As for many open source projects it can be relatively difficult to get metrics relative to success, especially since download numbers can be heavily biased due to Continuous Integration installation. While IRD download counts would be better, it requires infrastructure investment which is not included in this proposal. We will thus try to infer user and library adoption using different proxy metrics.

Number of libraries in the Python ecosystem that publish IRD as part of their release process is a metric of adoption by libraries and maintainers.
Number of third party users that publish IRD for their preferred libraries (without library author involvement) is a metric of how much users interest there is for IRD.
Qualitative user engagement on social media, blogs, tutorials and talks about this project.
Number of issues/PRs opened by unique users.

jupyter / papyri

CZI EOSS timeline, planning and report. #138