Open tovrstra opened 3 months ago
@leila-pujal @PaulWAyers @FarnazH P.S. I can take care of this if that would be helpful. (Adding nb-clean should be discussed first, and can be done in a later PR.)
Thanks @tovrstra for opening the issue. Just to give some background on this, @Ali-Tehrani worked on this in the past in PR #151. When working on the Libcint interface #174 we decided to include the pytest part of #151 but that a new PR should be opened to address all the fixes. That has been on my side but fell through the list of things. I can give it a first try @tovrstra so you don't have to deal with all the changes at first.
Sounds good. Feel free to ask for a review.
The only issue I see with using nb-clean
is that we are, in some of the later packages, using notebooks (together with their output) as documentation/tutorials. (Essentially we are replacing what we used to call "scientific documentation" in packages like ChemTools with Jupyter Notebooks.) In those contexts, not including cell outputs is problematic.
Note that some of the newer packages use the JupyterBook template for building the web site, which has a lot of perks but is probably incompatible with nb-clean
.
I think nb-clean
is more useful than it may seem at first sight. Last year, I've contributed to it, to make it configurable in what it does and does not clean. We've been using it for cleaning course notebooks in a Git repo, where some metadata was relevant to us, but we did want to keep the size of the repo and the diffs manageable. Earlier versions if nb-clean
were a bit too tidy. This is the relevant issue, originally opened for compatibility with juptyerbook: https://github.com/srstevenson/nb-clean/issues/98
For jupyterbook, one does not have to keep output cells. It can recompute them when the notebooks get published as github pages, and it can even cache such results to avoid recomputing the same outputs. See https://jupyterbook.org/en/stable/content/execute.html. In general, this is far less error-prone than manually keeping the outputs up-to-date with code changes (in the notebooks or in libraries).
There are some cases where one may want to keep the cell outputs in a Git repository, but I'd rather try to avoid being in such situations:
When one would like to have outputs in the notebooks while browsing the Git repo, as a form of documentation. This can be avoided by building the outputs with jupyterbook (or nbconvert) in a GitHub action and hosting the results outside the repo.
When the notebooks are computationally very demanding, making it impractical to compute them in a GitHub action. That said, demanding notebooks go against the typical interactive use case of notebooks, so I would also avoid this. (I've worked with such notebooks by submitting them as jobs on a cluster, but that was not very convenient.)
Another scenario is to use them as a form of regression testing. One can build an action that checks if the outputs in the cells have changed and require developers to approve such changes, or fix things if the changed outputs were not expected. Setting this up may be a bit tricky, but could be useful. Even in this case, I would look for an option to host the notebooks with outputs outside the Git repo. (I'm not aware of any practical ways of doing this at the moment.)
Thanks for the detailed explanation. With this sort of workflow, I don't have a problem: it's OK to (re)compute output associated with building the web page. Certainly there are cases (more and more, I dare say, as there are a lot of people using things like PySCF and PyCI inside Jupyter notebooks, and I know our HPC supports submitting/queuing Jupyter notebooks) where the notebook itself is computationally demanding, but cacheing would circumvent this problem.
I imagine that for cases where we might have computationally intensive Jupyter notebooks, we could pull them directly from an external repository, similar to the way we host (large) data files outsid the AtomDB repository. It's something @marco-2023 could comment on. For gbasis
, this does not seem to be much of an issue. (For PyCI
, it might be.)
The following may also be useful for authoring notebooks in Git repositories: https://jupytext.readthedocs.io/
It is a way of writing notebooks as Python or Markdown files, which can be easily converted into notebooks when needed. This way, you can avoid the problem of incomprehensible diffs when changing notebooks in a Git repository. (These are essentially JSON files.)
Do you have any experience with Jupytext? I'm not sure yet how good it is in practice.
I looked at Jupytext
but decided that quarto
was better. I'm using quarto
a good bit now, even for presentations. It actually provides a complete authoring system (with journal templates!) but I haven't used it for that yet, as I think its a little inferior for collaboration (especially collaboration with less-tech-savvy people) than Overleaf.
I'm thinking about tranferring all my course materials from JupyterBook over to Quarto, but it's more work than I want to do right now.
Is your feature request related to a problem? Please describe.
There is currently a pre-commit configuration, but when running
pre-commit run --all
, one still gets a lot of errors, making it impractical at this stage. It would be good to create a PR in which all these errors are cleaned up, after which we can enablepre-commit.ci
to run the checks on every PR.There are probably more useful pre-commit linters to be included. I would recommend including nb-clean. It removes all cell outputs and other irrelevant details. These should normally never be included in a Git repository: They take up a lot of space, cause diff noise and can be trivially recomputed.
Is there a specific integral/formula that you would like implemented?
No, just CI stuff.
Is there a change to the code or algorithm you would like to see?
No, just CI stuff.
Additional context
See #187