holoviz-topics / examples

Visualization-focused examples of using HoloViz for specific topics
https://examples.holoviz.org
Creative Commons Attribution 4.0 International
82 stars 24 forks source link

Simplify the project build step to avoid errors #202

Closed maximlt closed 7 months ago

maximlt commented 2 years ago

Building an example for the site can prove to be difficult, even if the project environment is locked. This happens because at the project build step some extra dependencies are installed in the default environment of the project to run nbsite:

https://github.com/pyviz-topics/examples/blob/17a17be1a1b159095be55801202741e049a780e8/.github/workflows/build.yml#L49-L53

In a way, this defeats the purpose of locking the environment originally, since this leads to changes in the default environment locked by the project creator. However I think there may be a way to avoid having to install extra dependencies in the default environment. An evaluated project contains:

image

As far as I understand it, the real purpose of running nbsite during the build step is mostly to generate evaluated notebooks and thumbnails, these are really the tricky bits (they are then pushed during the project build step to the evaluated branch, and later fetched during the docs build step).

The obvious alternative for the thumbnails generation is to manually generate them, which I think would be an improvement anyway since the automatically generated thumbnails don't always look nice, require extra dependencies (selenium) and can be a little heavy. If not provided by the project creator, a default thumbnail could be added by a simple script.

As for evaluating the project notebooks, nbsite does generates them by using nbconvert which happens to be a dependency of the notebook package:

The notebook package is a dependency of each project, so each project has indeed a version of nbconvert always installed. nbconvert can execute and save a notebook with the command jupyter nbconvert --execute --to notebook nb.ipynb. nbsite uses nbconvert programmatically to execute a notebook, however note that it subclassesExecutePreprocessor to customize a couple of things. If these changes are still required to execute the notebooks of examples, then examples should have a small evaluate_notebooks.py script that imports nbconvert to execute and export a notebook (it isn't that much code).

As such the build step could be turned into the following, and would no longer require any dependency to be installed to run the notebooks.

           anaconda-project prepare --directory $DIR 
           source activate $DIR/envs/default
           python scripts/evaluate_notebooks.py --directory $DIR
           source deactivate
           conda install nbsite
           doit build_project --name $DIR

Note the last doit build_project step that actually runs nbsite, except that this time the notebook has already been evaluated and saved in the /doc folder, so nbsite won't re-evaluate it again (reminder to self: might need to deactivate the auto-execution of notebooks by myst-nb).

I wonder whether that last step could be simplified, not running nbsite at all in the project build step. I'm not sure to which extent the RST files are required to be saved in the evaluated branch, they are generated automatically by nbsite when a gallery is defined so they would also be created in the docs build step.

If this all works, the documentation should indicate that new projects must have nbconvert added to their dependencies (who knows if it's always gonna be a dependency of notebook).

jlstevens commented 2 years ago

Thanks for outlining exactly the sort of (improved) approach I've had in mind for a while now! The question then is who will implement it :-)

jbednar commented 2 years ago

I love seeing a good plan like this, because then I can just imagine it's already been implemented and I can be happy. :-)

maximlt commented 2 years ago

The question then is who will implement it :-)

The one who opened the issue! The first step though in my mind is to update nbsite to be able to use nbconvert >= 6, right now it's stuck to nbconvert <6. In that process I will see if subclassing ExecutePreprocessor is still required.

jlstevens commented 2 years ago

There the main thing we needed is the functionality of the SkipOutput, NotebookSlice and NotebookSlice Preprocessors though I think all of these can be applied to an evaluated notebook without needing an execution context.

I hope I am right in thinking that only thumbnailing needed to be run in the live environment...

maximlt commented 2 years ago

I hope I am right in thinking that only thumbnailing needed to be run in the live environment...

I also hope so.

As for the preprocessors, they can be applied to the evaluated notebook.

jlstevens commented 2 years ago

If I remember right, there is also some preamble python code to execute first (e.g. to run imports without displaying them). The same effect can be achieved by adding a fixed cell to all notebooks and slicing it out after.

maximlt commented 2 years ago

Contributors have the possibility to push directly evaluated notebooks to the evaluated branch if their notebook(s) require datasets so large that they would take too long to download during the build, or there wouldn't be just enough storage on the CI system to save them. This is well described in this section of the documentation.

There's currently no way for a project to declare that it has notebooks that should not be evaluated in the build process. There are two website building options offered to the contributors:

These two options are described in the documentation. In practice orphans doesn't appear to be used at all, skip is used once in iex_trading:

https://github.com/pyviz-topics/examples/blob/10772f10bc5e8dfd7c3bd8a894d372135176b39a/iex_trading/anaconda-project.yml#L14-L15

(1) I'd suggest adding a new option called skip_evaluation_project_build (naming suggestions welcome!) that would list the notebooks that should not be evaluated part of the project build. I specifically target the project build as the repo has a GH Action that runs the project tests on project changes, that apparently uses smaller datasets stored in the test_data folder instead of the full datasets. This is also described in the documentation.

(2) I'd also suggest that the evaluated notebooks should be pushed by contributors in their PR branch vs. directly to the evaluated branch as explained in the documentation.

Enforcing (1) and (2) would make it a lot easier to update a notebook whose evaluation should be skipped. Currently the notebook has either to be re-executed locally, or changed (e.g. typo) and then committed to evaluated. With (1) and (2), if an already-evaluated notebook is modified by a contributor, during a project build it would simply be copied as is to the evaluated branch.

maximlt commented 2 years ago

Another suggestion that came to me while having a look at all the anaconda projects, it's difficult to differentiate between the keys that are for anaconda-project itself and those that are used to build examples. I would rather have them all nested in a single key, e.g. examples_info or examples_config. I assume user_fields would still have to be set at the root level, even if it's not 100% clear to me how it's used and by which tool.

maximlt commented 2 years ago

I've gathered information about the data source of each project, to try to get a better picture of how that works and identify the notebooks whose evaluation should be skipped in the project build process.

The data source column contains:

has small test data indicates whether the project has an accompanying folder in test_data/. note is just additional information that I thought was worth recording, like the absence of lock files.

project data source has small test data note
attractors x
bay_trimesh project yes
boids x
carbon_flux notebook/intake
census project yes
datashader_dashboard notebook/intake yes very large data in there
euler x
exoplanets project yes no lock file
gapminders notebook/download with pd.read_csv(external_url)
genetic_algorithm x
gerrymandering project yes
glaciers notebook/repo
goldbach_comet x
gull_tracking project yes evaluated notebook committed on master
heat_and_trees notebook/intake yes
hipster_dynamics x
iex_trading project yes
landsat project yes
landsat_clustering notebook/intake no lock file
landsat_classification notebook/intake catalog not local, fetched pulled from S3/earth-data`
lsystems x
ml_annotators x
network_packets project yes
nyc_buildings project yes
nyc_taxi project yes
opensky project yes accompanying script to download the data (cron job that ran 4 days)
osm indications yes Why not having the S3 download links in the project?
particle_swarms x
penguin_crossfilter notebook/repo
portfolio_optimizer notebook/repo
seattle_lidar notebook/intake no lock file
ship_traffic project + notebook/repo yes
square_limit x
sri_model x
uk_researchers project yes
voila_gpx_viewer x
walker_lake notebook/intake yes

Now, this little detective work has not yet allowed me to determine exactly all the notebooks that should not be evaluated. The only obvious ones at this stage are those of the osm project, since the project doesn't provide any way to download the data automatically. I'm not sure it's the only one, I'll need to search a little bit.

For information of the files we store on S3/datashader-data these ones are larger than 1 GB: image

jbednar commented 2 years ago

All of these proposals sound great to me; I like the direction this is going!

maximlt commented 2 years ago

Couple of notes after talking with Jean-Luc:

jlstevens commented 2 years ago

Yes, and I think storing evaluated notebooks on the master branch (only for the projects that need it) is not a bad idea either (using the skip_doc_evaluation key to drive CI copying the notebook over to the evaluated branch).

maximlt commented 7 months ago

This is implemented.