Simplify the project build step to avoid errors

maximlt commented 2 years ago

Building an example for the site can prove to be difficult, even if the project environment is locked. This happens because at the project build step some extra dependencies are installed in the default environment of the project to run nbsite:

https://github.com/pyviz-topics/examples/blob/17a17be1a1b159095be55801202741e049a780e8/.github/workflows/build.yml#L49-L53

In a way, this defeats the purpose of locking the environment originally, since this leads to changes in the default environment locked by the project creator. However I think there may be a way to avoid having to install extra dependencies in the default environment. An evaluated project contains:

a thumbnails folder, generated by nbsite if not already present
the executed/evaluated notebooks, generated by nbsite
an archive of the project, built with the doit archive_project command that does not require extra dependencies
RST files, generated by nbsite

As far as I understand it, the real purpose of running nbsite during the build step is mostly to generate evaluated notebooks and thumbnails, these are really the tricky bits (they are then pushed during the project build step to the evaluated branch, and later fetched during the docs build step).

The obvious alternative for the thumbnails generation is to manually generate them, which I think would be an improvement anyway since the automatically generated thumbnails don't always look nice, require extra dependencies (selenium) and can be a little heavy. If not provided by the project creator, a default thumbnail could be added by a simple script.

As for evaluating the project notebooks, nbsite does generates them by using nbconvert which happens to be a dependency of the notebook package:

The notebook package is a dependency of each project, so each project has indeed a version of nbconvert always installed. nbconvert can execute and save a notebook with the command jupyter nbconvert --execute --to notebook nb.ipynb. nbsite uses nbconvert programmatically to execute a notebook, however note that it subclassesExecutePreprocessor to customize a couple of things. If these changes are still required to execute the notebooks of examples, then examples should have a small evaluate_notebooks.py script that imports nbconvert to execute and export a notebook (it isn't that much code).

As such the build step could be turned into the following, and would no longer require any dependency to be installed to run the notebooks.

           anaconda-project prepare --directory $DIR 
           source activate $DIR/envs/default
           python scripts/evaluate_notebooks.py --directory $DIR
           source deactivate
           conda install nbsite
           doit build_project --name $DIR

Note the last doit build_project step that actually runs nbsite, except that this time the notebook has already been evaluated and saved in the /doc folder, so nbsite won't re-evaluate it again (reminder to self: might need to deactivate the auto-execution of notebooks by myst-nb).

I wonder whether that last step could be simplified, not running nbsite at all in the project build step. I'm not sure to which extent the RST files are required to be saved in the evaluated branch, they are generated automatically by nbsite when a gallery is defined so they would also be created in the docs build step.

If this all works, the documentation should indicate that new projects must have nbconvert added to their dependencies (who knows if it's always gonna be a dependency of notebook).

jlstevens commented 2 years ago

Thanks for outlining exactly the sort of (improved) approach I've had in mind for a while now! The question then is who will implement it :-)

jbednar commented 2 years ago

I love seeing a good plan like this, because then I can just imagine it's already been implemented and I can be happy. :-)

maximlt commented 2 years ago

The question then is who will implement it :-)

The one who opened the issue! The first step though in my mind is to update nbsite to be able to use nbconvert >= 6, right now it's stuck to nbconvert <6. In that process I will see if subclassing ExecutePreprocessor is still required.

jlstevens commented 2 years ago

There the main thing we needed is the functionality of the SkipOutput, NotebookSlice and NotebookSlice Preprocessors though I think all of these can be applied to an evaluated notebook without needing an execution context.

I hope I am right in thinking that only thumbnailing needed to be run in the live environment...

maximlt commented 2 years ago

I hope I am right in thinking that only thumbnailing needed to be run in the live environment...

I also hope so.

As for the preprocessors, they can be applied to the evaluated notebook.

jlstevens commented 2 years ago

If I remember right, there is also some preamble python code to execute first (e.g. to run imports without displaying them). The same effect can be achieved by adding a fixed cell to all notebooks and slicing it out after.

maximlt commented 2 years ago

Contributors have the possibility to push directly evaluated notebooks to the evaluated branch if their notebook(s) require datasets so large that they would take too long to download during the build, or there wouldn't be just enough storage on the CI system to save them. This is well described in this section of the documentation.

There's currently no way for a project to declare that it has notebooks that should not be evaluated in the build process. There are two website building options offered to the contributors:

orphans: to list notebooks that should be rendered but not linked from the main page
skip: to list notebooks that should be skipped entirely when building the website

These two options are described in the documentation. In practice orphans doesn't appear to be used at all, skip is used once in iex_trading:

https://github.com/pyviz-topics/examples/blob/10772f10bc5e8dfd7c3bd8a894d372135176b39a/iex_trading/anaconda-project.yml#L14-L15

(1) I'd suggest adding a new option called skip_evaluation_project_build (naming suggestions welcome!) that would list the notebooks that should not be evaluated part of the project build. I specifically target the project build as the repo has a GH Action that runs the project tests on project changes, that apparently uses smaller datasets stored in the test_data folder instead of the full datasets. This is also described in the documentation.

(2) I'd also suggest that the evaluated notebooks should be pushed by contributors in their PR branch vs. directly to the evaluated branch as explained in the documentation.

Enforcing (1) and (2) would make it a lot easier to update a notebook whose evaluation should be skipped. Currently the notebook has either to be re-executed locally, or changed (e.g. typo) and then committed to evaluated. With (1) and (2), if an already-evaluated notebook is modified by a contributor, during a project build it would simply be copied as is to the evaluated branch.

maximlt commented 2 years ago

Another suggestion that came to me while having a look at all the anaconda projects, it's difficult to differentiate between the keys that are for anaconda-project itself and those that are used to build examples. I would rather have them all nested in a single key, e.g. examples_info or examples_config. I assume user_fields would still have to be set at the root level, even if it's not 100% clear to me how it's used and by which tool.

maximlt commented 2 years ago

I've gathered information about the data source of each project, to try to get a better picture of how that works and identify the notebooks whose evaluation should be skipped in the project build process.

The data source column contains:

x: the project doesn't require any external data
project: anaconda-project takes care of downloading the data
notebook/: data obtained when the notebook is executed
- intake: from an Intake catalog (local or online)
- download: data downloaded in some way, e.g. with pd.read_csv(external_url)
- repo: from the repo directly
indications: only indications as how to download the data are provided

has small test data indicates whether the project has an accompanying folder in test_data/. note is just additional information that I thought was worth recording, like the absence of lock files.

project	data source	has small test data	note
attractors	x
bay_trimesh	project	yes
boids	x
carbon_flux	notebook/intake
census	project	yes
datashader_dashboard	notebook/intake	yes	very large data in there
euler	x
exoplanets	project	yes	no lock file
gapminders	notebook/download	with pd.read_csv(external_url)
genetic_algorithm	x
gerrymandering	project	yes
glaciers	notebook/repo
goldbach_comet	x
gull_tracking	project	yes	evaluated notebook committed on master
heat_and_trees	notebook/intake	yes
hipster_dynamics	x
iex_trading	project	yes
landsat	project	yes
landsat_clustering	notebook/intake		no lock file
landsat_classification	notebook/intake		catalog not local, fetched pulled from S3/earth-data`
lsystems	x
ml_annotators	x
network_packets	project	yes
nyc_buildings	project	yes
nyc_taxi	project	yes
opensky	project	yes	accompanying script to download the data (cron job that ran 4 days)
osm	indications	yes	Why not having the S3 download links in the project?
particle_swarms	x
penguin_crossfilter	notebook/repo
portfolio_optimizer	notebook/repo
seattle_lidar	notebook/intake		no lock file
ship_traffic	project + notebook/repo	yes
square_limit	x
sri_model	x
uk_researchers	project	yes
voila_gpx_viewer	x
walker_lake	notebook/intake	yes

Now, this little detective work has not yet allowed me to determine exactly all the notebooks that should not be evaluated. The only obvious ones at this stage are those of the osm project, since the project doesn't provide any way to download the data automatically. I'm not sure it's the only one, I'll need to search a little bit.

For information of the files we store on S3/datashader-data these ones are larger than 1 GB:

jbednar commented 2 years ago

All of these proposals sound great to me; I like the direction this is going!

maximlt commented 2 years ago

Couple of notes after talking with Jean-Luc:

Suggested naming the new option skip_doc_evaluation
Suggested enforcing the presence of a lock file in a project for a build to complete (from me: a YAML linter wouldn't hurt in addition of that)

jlstevens commented 2 years ago

Yes, and I think storing evaluated notebooks on the master branch (only for the projects that need it) is not a bad idea either (using the skip_doc_evaluation key to drive CI copying the notebook over to the evaluated branch).

maximlt commented 7 months ago

This is implemented.

holoviz-topics / examples

Simplify the project build step to avoid errors #202