Strategy for documentation

jpivarski commented 4 years ago

High-quality documentation is a serious need. Uproot has one big page of tutorial (which doubles as a Binder-based interactive notebook), readthedocs for references, and StackOverflow for example-driven questions. Awkward 0 only has a big page tutorial.

Some observations:

The big-page tutorial was motivated by early observations that users seemed to know what was on the GitHub front page but not what was in readthedocs. "Search in page" seems to be much more common than clicking through to find things. Therefore, I moved all the introductory/tutorial-like text out of readthedocs and into the GitHub front page.
However, this does not scale. It's hard to add new content to this monolith because it breaks the flow of text. When I have serious changes to make (e.g. Uproot 2 → 3), I take a few days and rewrite the whole thing.
But it's wrong to assume that anybody reads this from top to bottom—surely, they're searching for examples, and it must be broken up to fit that pattern.
Almost nobody uses the interactive tutorial. It was broken for months (the notebook JSON was badly formatted—it couldn't even be loaded as a notebook) before somebody complained. Creating interactive tutorials is not time well spent. Removing this requirement also relaxes the one-big-page problem, because the big page was designed to be the same as the notebook.
Although I like to write about concepts and the big-picture vision of columnar analysis, there's more of a demand for "how do I...?" documentation that focuses on specific problems. I can't know all of these problems in advance, so the set of little "how do I...?" pages has to be an easily expandable set.
I had hoped that StackOverflow would provide the "how do I...?" examples, but so far, there has been considerable confusion about what is a usage question (for StackOverflow) and what is a bug report (for GitHub Issues). I'd like to have community help in filling out usage examples, but a pattern needs to be established first to give a sense of what kinds of things should have recipes and what kinds of things are just broken and need to be fixed.
One thought I had was to write the how-to documentation on StackOverflow by phrasing each task as a question and answering my own questions. (A bit like Jeopardy.) I found StackOverflow Documentation, which sounded promising, until I found out that it was cancelled one year later. Yikes! I'm glad I didn't try that.
Another promising option is to write a Wikibook. This would satisfy the constraints of being mostly single author (me) while still being open to community contributions (same openness as Wikipedia—no user accounts required!), it is easily editable (though a different syntax than markdown), and it isn't going anywhere. Wikibooks is an old project (started 2003), with peak interest around 2008, which makes it old and boring, and that's great for long-term support. Having settled into its stride, it seems to be mostly used for manuals, particularly LaTeX, C/C++, and radiation oncology (see previous link: related searches). The Haskell Wikibook is an example of what a finished book can look like, with searching capabilities, PDF/e-book versions, beginner's track/advanced track, and good syntax coloring. I'm strongly considering Wikibooks.
GitBooks also have the PDF/e-book features, the focus on writing software manuals, but it adds the possibility of interactivity. Also, it's young and hip, but still it's been widely popular since 2016 and has reached about the same level (in Google searches) as Wikibooks. Whereas Wikibooks has edit links on every page, inviting minimum-barrier contributions, GitBook is git-backed and has a pull-request process. (Or, it would be because I'd link GitBook to GitHub and edit it there.) This puts two things in GitBook's favor: (1) editing doesn't have to be online and (2) we can easily backup all the content with git clone. I'd prefer not putting too many barriers in the way of users wanting to contribute, and if I didn't know better, I'd think the "get GitHub account, submit pull request" would be a lot more prohibitive than "click on the 'edit' link." However, we have plenty of experience with Twiki to know that our community doesn't like to edit "somebody else's" wiki and is surprisingly more willing to go through the pull request process. (And just about everybody in our field has a GitHub account.) Maybe people feel better about suggesting an edit than making an edit because they're humble enough to want review. Anyway, I'm even more strongly considering GitBooks.
I think the GitHub front-pages for Awkward 1.0 and Uproot 4.0 should be kept clean, with minimal content except links to the documentation. Awkward should probably get three big buttons: "Documentation for data analysts," "Documentation for framework developers," and "Documentation for Awkward developers." The first covers ak.Array and its operations; the middle covers layouts and writing software that depends on Awkward; the last covers the project as a whole, including "how to contribute."
The C++ should all get Doxygenized, the Python should all get Sphinxized, and if it's possible to put both of these on readthedocs, that's what I'll do. Or maybe I should try to centralize things using something like Doxybook2 to turn Doxygen XML into GitBook pages (that can then be integrated with the main GitBook documentation). For Python, perhaps doxypypy can be used to turn Python docstrings into Doxygen, which then goes to GitBook. This project seems to have plateaued, but why shouldn't it? It just has to turn human-readable docstrings into the appropriate Doxygen tags. I'm strongly considering a doxypypy → Doxygen → GitBook workflow for Python, and Doxygen → GitBook for C++.
I don't want to deal with Sphinx. Writing reST is no fun and even with autodoc, you have to remember to add reST to invoke the autodoc. The "Documentation for data analysts" (and possibly "Documentation for framework developers") should have hand-picked links to useful functions, but the "Documentation for Awkward developers" should always include all classes, functions, etc.

Thoughts? Suggestions?

Let me know below. With documentation, it's hard to back up and use something different once some serious writing has begun because every toolchain has consequences on how the text is formatted. I want to get this right the first time.

jpivarski commented 4 years ago

@henryiii's class as a JupyterBook: https://henryiii.github.io/compclass-book/week1/0_IntroductionAndLogin

henryiii commented 4 years ago

See my comment here: https://github.com/scikit-hep/scikit-hep-tutorials/issues/1.

For a single package, the nbsphinx extensions works well for examples (See boost-histogram's examples). For cross-package tutorials, I think JupyterBook would be the best option.

henryiii commented 4 years ago

Note: GitBook is actually the name of an abandoned Open Source software package that I used to make Modern CMake, the CLI11 tutorials, and a few more, and has been replaced by a service. We've been having to migrate away from GitBook; for example the LHCb starter kit was converted to a different technology a couple of months ago.

jpivarski commented 4 years ago

I didn't want to get into the details in the meeting, but I think cross-package tutorials and JupyterBook-based Awkward-only tutorials are a good idea. For the time being, I'm trying to figure out a strategy for the Awkward-only tutorials. I'll be doing a big documentation push pretty soon, and I want to go in the right direction.

Doxygen will be involved, definitely for C++ and probably for Python. (That doxypypy looks pretty good. I don't want to have to remember to add autodoc stubs in reST files; I've forgotten it too many times in Uproot docs.)

The main thing will be nugget-sized "how tos" and maybe one "getting started" tutorial. I'd kinda like the how-tos, tutorials, and reference docs (Doxygen output) to be together—maybe in the same format, same site, and/or same toolchain.

I'll take a look at JupyterBooks.

jpivarski commented 4 years ago

Well, JupyterBooks isn't popular. As long as it isn't in danger of going away the way StackOverflow Documentation did, that's the important point.

But supposing the community loses interest in JupyterBook, then we'd be left with documentation in a bunch of Jupyter notebooks, which isn't a major problem. There will be other ways to automatically convert it as documentation. (Contrast this to all the Google Code wiki markup I wrote to document old projects. It's kinda readable as Markdown...)

I know for a fact, though, that nobody reads the Binder tutorial on Uproot. The fact that nobody complained when it was broken for months is a strong indication. Binder isn't the solution: it takes too long to load.

jpivarski commented 4 years ago

Thebelab is nice. It takes 20 seconds to start a kernel for interactivity, which is longer than casual readers will wait, but it doesn't take you away from the page the way a Binder link would.

Maybe the following would work:

Host all documentation in the same repo as Awkward itself, i.e. this scikit-hep/awkward-1.0 site (whose name will change to scikit-hep/awkward-array). This GitHub blog says that I can do it with a docs folder, which I have. The awkward-array.org domain is open, but maybe https://scikit-hep.org/awkward-1.0 would just work (once some content is there).
The C++ and Python reference documentation (API) from Doxygen, sourced by Doxygen-style comments (C++) and docstrings (Python and doxypypy). Doxygen generates static HTML that I'd host as-is. On the plus side, it would look like standard Doxygen, and that helps people who are familiar with that. (Reference documentation is for programmers.)
Use JupyterBook to generate a Jekyll site on github.io. Everything would be vanilla: no customizations, though Thebelab should be turned on. Jekyll must have an option for including static HTML (from Doxygen) in the same site that is generated from JupyterBook.
I would prefer writing the text in Jupytext with "percent" format, instead of in-browser notebooks. Then I can use Emacs and Atom/Hydrogen for writing (which I highly prefer). I'd generate the output with jupyter-book build --execute.
It would be nice to include automated testing—the Uproot docs have gotten out of date, from time to time. Since the source files for all of these documents are Jupyter notebooks, there must be an automated way to run them and at least verify that they raise no exceptions. (And I suppose I could put assertions in hidden cells.) Note: in Jupytext's "percent" format (*.pct.py), testing for exceptions is just a matter of running the script. All the non-Python metadata is in comments.
I'd build the HTML locally while writing, but in general, the Doxygen build and maybe Jupyter notebook check should be run automatically in an Azure pipeline when a PR is pushed to master. That would be a third pipeline, beside the build/test triggered by commits to all branches and the deploy triggered by a new tag.
The content would be divided into "Awkward Data Analysts," "Awkward Framework Builders," and "Awkward Developers." They should probably be separate books, with three links on the GitHub front-page. The "Awkward Data Analysts" book would be dominated by (exclusively?) example-driven how-tos, which don't fit a book format very well: it would be a long list in the left-column. The incremental search in JupyterBook is nice, if users notice it.

henryiii commented 4 years ago

JupyterBook is fairly new. And it is basically vanilla, except for optional cell customizations, such as the ability to hide a cell (like a setup cell) in the HTML output; I don't think you'd need those and they wouldn't have a strong effect anyway, and they are just normal cell metadata.

The good thing about JupyterBook, besides the fact you can use the Jupyter Notebooks directly, even directly view them on Github, is that it is not the only Notebook to static site system. nbsphinx for Sphinx does the same thing, so if JupyterBook goes under, we could switch to Sphinx fairly quickly. Some other static site generators are gaining support for notebooks too. It's just currently the nicest and is now an official part of the Jupyter organization. (I think this happened recently!)

I think the outline you've described above is very much what I was thinking, though I'd use GitHub Actions, as it's a little simpler and you don't need the release mechanism. I also would avoid checking in the generated html into the source (which I currently am doing for the demo site, but I think I can work around that) - it would be force pushed to a gh-pages branch or similar. I'll be playing with this soon(ish) for a tutorial setup and will update you when I have it working.

jpivarski commented 4 years ago

I'd do it in GitHub Actions if I could move all CI/CD to GitHub Actions. I don't know that I need to do that migration first, though.

Reading about this has introduced me to Jupytext, which is great! The plain files that Jupytext uses are more version control-friendly and I wouldn't have to launch a JupyterLab every time I want to edit something (which is a bottleneck for me).

nbsphinx (which I hadn't known about before, either) starts from ipynb notebooks, but a Jupytext → nbsphinx → readthedocs can be a fallback plan if JupyterBook disappears.

I don't want to save HTML output in GitHub either. At least not the main scikit-hep/awkward-1.0 repository. Maybe the documentation build process can send whatever Jekyll/GitHub Pages needs to a repo that's not meant to be edited directly.

The JupyterBook documentation suggests Netlify, which could be a good repository for static HTML, if that's what the documentation build process produces. I used to use AWS buckets for that: static HTML hosting is free. But as much as possible, I'd like this to be unconnected from personal accounts (such as my personal Azure account) and on shared accounts (like the scikit-hep organization). Maybe Zeit. They look friendly.

henryiii commented 4 years ago

I've been using Azure for wheels and GHA for docs and tests, and that's been working well. They all run on pretty much the same Microsoft backends and are the configs are mostly the same except for a 1:1 term mapping. I don't think there's harm in selecting the one that best for a particular job. It's only been in the last year or so that we could run on a single system; it used to be a different system per supported OS at least. Note there is no setup at all for GHA, since it's built in and always available. You already have an actions tab across the top of all your repositories. It just runs based on the presence of files .github/workflows/*.yml.

Jupytext sounded very promising, yes, I just heard about it recently. Launching JupyterLab is a pain.

IRIS-HEP has a dual repo setup for hosting, that does work.

I haven't looked into Netlify.

henryiii commented 4 years ago

Note for related projects (vector, boost-histogram, etc): This strategy is probably a bit different, since Awkward is using Doxygen instead of Sphinx. If Sphinx is already being used, then nbsphinx is probably the correct choice.

jpivarski commented 4 years ago

From @kratsg: look into "breathe" and "exhale". It turns Doxygen into Sphinx.

lukasheinrich commented 4 years ago

also see @cranmer's jupyterbook

https://cranmer.github.io/madminer-tutorial/intro

henryiii commented 4 years ago

Also mine: https://henryiii.github.io/compclass I've built GHA to handle running the notebooks and producing the JupyterBook, and no output is saved in the repo. Repo here: https://github.com/henryiii/compclass

jpivarski commented 4 years ago

This is pretty much figured out:

Doxygen for C++
Sphinx for Python
JupyterBooks for tutorials

Of these, the first two are done.

scikit-hep / awkward

Strategy for documentation #158

Some observations:

Thoughts? Suggestions?