jupyter / papyri

MIT License
83 stars 17 forks source link

CZI EOSS timeline, planning and report. #138

Open Carreau opened 2 years ago

Carreau commented 2 years ago

This is a trimmed down version of the timeline and goal we planned in the original grant for better public tracking.


We propose overhauling the Jupyter and IPython interactive documentation framework with many features (inline graphs, navigation, indexing) while providing access to content (tutorials, how-to’s, examples, gallery) currently only accessible through hosted documentation websites. Building a better understanding of the Python Ecosystem’s documentation convention into IPython and Jupyter will also augment those capabilities with many desirable features, like local search, indexing, cross-references, and many others for a best-in-class documentation experience.

Python has multiple stories for documentation: docstrings, narrative documentation built via Sphinx and hosted online, Dynamic Tutorials by downloading notebooks. However, despite the diversity, none of these offer a complete documentation experience.

When interactively exploring data in IPython/Jupyter powered tools, the question-mark operator is the typical entry point to access documentation. This has many advantages: showing the documentation without the user having to search for the right page, or know the types of objects. However, it lacks the richness of hosted documentation: limited to text, no images, navigation, links or indexing. It is limited to docstrings and cannot expose tutorials and narrative sections critical for discovery and understanding. Users are exposed to raw source code containing LaTeX equations and Restructured Text directives, making for a poor user experience and lack of accessibility.

While hosted documentation is better in some of these aspects, it is often scattered across the web, does not reflect versions of libraries installed by users, and is often shadowed in search engine results by poorly maintained click-bait articles, leading to confusion and poor coding habits among practitioners.

Library authors are constrained in their technical writing to decide whether to prioritize interactive session documentation or hosted versions, leading to long-standing debates (Sympy’s syntax for equations), or complex and costly workarounds (Matplotlib, Napari and Pandas dynamic docstring generation at runtime).

Via a reusable framework we call Papyri, included in IPython/Jupyter, we can offer a state-of-the-art documentation experience to end-users. Our current proof of concept allows library authors to publish a semantic Intermediate Representation Documentation format (IRD). On users’ machines, tools can leverage IRD to provide access to the Python Ecosystem documentation’s full richness. Our prototype shows that the following features are in reach:

We believe the above is a first step to enhance the documentation experience for both consumers and authors. This project represents the key for the development, quality, ease of use, and discoverability in a growing Python ecosystem.

Additionally, this framework will open the door to several other valuable features, such as allowing docstrings to be written in the widely-used markdown format, better configuration of end-user appearance and preferences, translations and domain-specific alternatives, indexing, and others.


There are three technical components that need to be addressed. 1) Providing the tools to generate IRD from library source code 2) Installing and rendering IRD on users’ machines, and 3) Uploading and distributing IRD files. For this proposal we request funding the first two. The last one can be achieved by reusing other infrastructures like GitHub Pages, GitHub Actions, or a conda-forge-like model.

The key user-facing components of this project require either extensions or changes within IPython and JupyterLab. Developing these as extensions allows a large flexibility in release timeline and allows integration with already released versions, widening the pool of users who can access early prototypes. Once extensions are well-developed and stabilized, those features can be migrated to the core IPython and JupyterLab. The IPython monthly minor releases make it easy to regularly incorporate these improvements to users. We expect one major release of IPython mid 2022, which would be the opportunity to make large changes if necessary. Major versions of JupyterLab are published with a cycle of about 6 months, which give us several opportunities to make the Papyri extension part of the default set of shipped extensions.

Building and publishing of IRD files by libraries can be done after release of the library, therefore roadmaps of other projects we would build documentation for do not affect this project’s schedule.

A significant community investment is also necessary to provide the right models and get adoption across the scientific community. A number of projects are already using Sphinx with various configuration options and specialized extensions for each library. It will be critical to engage with those libraries to make sure the features they currently use and their documentation build processes can be accommodated by Papyri. As this will rely on developing a standard for IRD files to publish and ship documentation to users, agreement across the core Scientific Python ecosystem will need to be reached for the format of IRD files.


Year One:

The first six months will be targeted toward publishing a usable prototype to quickly gather feedback and drive user contribution.

Month 6 to 12 will revolve around presenting progress at SciPy to expand adoptions.

The second year focuses on growth, and extending functionality, which is critical for a self-sustaining project and seeking future sources of funding.

The last six month will be marked by the second presentation at SciPy, stabilisation and release of a first stable as part of IPython and Jupyter.

Deliverables consist of both implementation and specification of IRD format in order to allow and encourage competing implementation and tooling. This includes:

As for many open source projects it can be relatively difficult to get metrics relative to success, especially since download numbers can be heavily biased due to Continuous Integration installation. While IRD download counts would be better, it requires infrastructure investment which is not included in this proposal. We will thus try to infer user and library adoption using different proxy metrics.

Carreau commented 2 years ago

Let's try to summarise a bit the first Year, I use this/these comments as a draft, but need to write an actual report as PDF for numfocus, though this helps with community and transparency.

Here are some of the high level points from the grant:

Documentation from within IPython/Jupyter with rich text, images, and rendered mathematics.

The Jupyter Lab extension can be found here, we do have rich text, images and part of the navigation.

Screen Shot 2022-04-14 at 10 13 13

While math rendering is not done, it should be trivial, I just need to find someone with more experience in JupyterLab to avoid reinventing the wheel. The hooking into ?/?? operator is not done either,

Access to narrative sections, tutorial, examples, and image gallery.

gallery and example are functional in the standalone rendering but not been done in the JupyterLab application, though we know those are doable. Narrative sections is one of the place that has seen the less progress due to complexity of and difference in each library, though we do have static rendering.

Seamless integration and navigation across libraries.

Backward and forward link and navigation works in 90% of the cases, there are still edge cases we are working on.

Better in-built accessibility features, and the ability to customise users’ preferences.

We haven't started looking into accessibility, it will come later once the the JupyterLab extension is further along.

Ensure documentation matches the user's installed libraries version. Avoid dynamic docstring generation and their performance impact on libraries.

Both of these goal have not been started as the project is still too early on the alpha stage.


Year One:

The first six months will be targeted toward publishing a usable prototype to quickly gather feedback and drive user contribution.

While the prototype have been published we had less engagement than planned from users. Why many people expressed interest few took the time to install and report back on features/limitation. We got interest report from Tech company in the san francisco bay area as well as from other documentation oriented projects.

Review the core supported features and critical needs from existing Python libraries for a usable prototype Implement Parsing of Numpydoc formatted Docstrings

We manage to parse and render most of the documentation for multiple projects, a static version can be seen here. Most of numpy and scipy is covered, we are pushing compatibility with networkx, skimage, dask, astropy.

More than two dozen documentation error have been found in numpy and scipy while working on papyri.

https://github.com/numpy/numpy/pulls?q=is%3Apr+author%3ACarreau+is%3Aclosed https://github.com/scipy/scipy/pulls?q=is%3Apr+author%3ACarreau+is%3Aclosed

Implement prototype JupyterLab and IPython extensions to render IRD files

The jupyterlab prototype extension works but is not yet easy to install for users as still tightly couple with the main papyri codebase.