jupyter-book / mystmd

Command line tools for working with MyST Markdown.
https://mystmd.org/guide
MIT License
217 stars 64 forks source link

Embed execution environment metadata in MyST documents #1265

Open choldgraf opened 5 months ago

choldgraf commented 5 months ago

Description

Many MyST documents represent analyses and computations that require a particular environment in order to run. When MyST is used as an entrypoint into computational environments, it requires knowing what software needs to be present in the analysis in order to successfully execute.

MyST already has a mechanism for defining metadata for each page in a project. If pages have different software environment requirements, allowing a user to define these environments could allow projects like Binder, JupyterHub, or Thebe to build and load the proper environment on the fly.

This could be a way to increase the reproducibility and provenance of documents built with MyST, and to make it easier to share MyST content as self-contained text files.

An example from Python

Recently there have been some efforts at embedding metadata within Python scripts. The inline script metadata specification is now a formal way to embed metadata in scripts. This is supported by Hatch to auto-build environments.

A simple example in MyST

A simple example of what this might look like:

---
env:
  type: (specifies the installation command, dependent on the language of the kernel as defined in a `jupyter` block)
  items:
    - item1
    - item2
    - Each is installed by the method implied by `type:`
---
# My MyST title
...
agoose77 commented 5 months ago

I like the goal of this issue. When thinking about execution, I think both the traditional 'environment' (kernels, etc) and data form the context of execution. So, my feeling is that to truly capture the environment, we should point to the REES context of the environment.

In the scikit-hep ecosystem, there's a repo-review tool that helps users to build packages that follow best practice. We can ensure myst does something similar to encourage REES, which will help us in future to leverage e.g. Repo2docker in myst execution.

lwasser commented 5 months ago

just a note that one of the other challenges here is to also consider conda users. We have had a lot of (lively) discussions around this in our community! Hatch is super awesome. and @ofek (hatch dev/maintainer) is the person who actually showed me how this can work for a user. it's fantastic. There is also the somewhat unsupported conda-execute - @ocefpaf could speak more to this.

i think from our pyopensci perspective we'd like to see an easy way for users / scientists to share code and workflows that made the environment piece of things (a major pain point) simpler. having a document that executes code is great. but we also know that the environment around can be challenging. The //scripts pep definitely is moving in a better direction but i don't think people know about it - yet.

i plan to show folks the hatch implementation. but i know we need a conda solution too. i also am aware chris that environments is not necessarily in the scope of the myst spec BUT it's really important to consider the how will they share the entire workflow without needing to be a docker expert!

ocefpaf commented 5 months ago

This is a great idea and kind of a holy grail some of us have been searching for a long while. Note that the conda-execute mentioned above is a dead project but there are new efforts with tools like pixi.

I guess that the main challenge, for my use case of course, is to extend that to non-Python dependencies. I may be wrong but I believe that hatch's script is Python packages only, right?

I have some crazy notebook that mix R and Pyhton, call C++ tools from the CLI, etc. I'd love for this declarable env to support conda packages so I could do those crazy notebooks in a more reproducible way without the need to share an extra file (conda-lock, env yaml, etc.).

ofek commented 5 months ago

Hello everyone! Your use case is one of many reasons why I pushed hard to support arbitrary tooling data and fortunately it was accepted after some pushback.

This is not implemented yet but should be easy: Hatch environments have a type which can refer to other environment plugins so what I can do is add a new method to the environment interface that indicates script execution so for example the Conda plugin can do what you want. Technically this is already possible if the author wants to but I think the indicator for whether the user is trying to run a script is not obvious/documented.

This is possible 😄

The other use case I had in mind was to store the contents of a lock file there which is why I had a section in the PEP for how to write data to the block.

stevejpurves commented 5 months ago

Just to note that there is already some functionality in place or overlapping with this for this at the project level at any rate.

See https://mystmd.org/guide/frontmatter#available-frontmatter-fields where requirements and resources these are being used to specify project level environments where requirements are by convention any REES configuration file and resources are any other files that should be included in a reproducible (MECA) bundle. This was intended to cover non python only dependencies and overlaps/aligns exactly with how R/quarto is specifying the same.