sjdv1982 / seamless

Seamless is a framework to set up reproducible computations (and visualizations) that respond to changes in cells. Cells contain the input data as well as the source code of the computations, and all cells can be edited interactively.
http://sjdv1982.github.io/seamless
Other
20 stars 6 forks source link

Reproducibility #84

Open sjdv1982 opened 3 years ago

sjdv1982 commented 3 years ago

On the long term, with help from experts, this could grow into a formal standard for reproducible computing. For now, this is more a roadmap/description of the situation.

Scope: Transformations. On the long term, for transformations in any language. For now, just Python/IPython, because Seamless internally translates all transformations into Python/IPython transformations that integrate the foreign code (using cffi for compiled languages, popen for bash, IPython magics for cython/R, etc.).

Mechanisms

  1. Data dependencies. Each celltype has a canonical serialization/deserialization, and the checksum (SHA3-256) on the canonical serialization is computed. Only the celltypes "cson" and "yaml" (and "python", see below) are different, as they have a distinct semantic checksum. CSON and YAML are first translated into JSON, and then their checksum is computed. This means that comments etc. can be added without retriggering computation. Status: solved.

  2. Transformer code without code dependencies. Transformer code is simply another data dependency of the transformation. The code is supposed to contain a block of statements that eventually assign a variable, whose name (normally result) and celltype are defined in the __output__ property of the transformation. The code is evaluated by its semantic checksum, which means for (Python) code, the creation of an AST buffer using ast.parse and ast.dump (*). Status: solved for 0.7.

  3. Transformer code with module dependencies The celltype of a dependency can also be "module". A module is a dict with the following properties:

    • type: "interpreted" or "compiled". Compiled modules are discussed elsewhere.
    • language: "ipython" or "python". Other languages are discussed elsewhere.
    • code: For simple modules, the (I)Python code. For packages, a dict of "filename":"python code" entries.
      Status Simple modules and packages work.
  4. Transformations with an environment The environment can be specified as an image, as a conda environment, and/or as a set of capabilities. For a transformer to be executed, only one of the three needs to be matched. In other words, a transformation can be executed because of an image match OR a conda match OR a capability match. Status Solved for 0.7.

Environment

The environment is a transformer property __env__ that can have the following properties.image, conda, capabilities, powers.

Image

A dict that contains at least name, which is the name of an image. This is in principle a Docker image, although Singularity may be used to actually execute the transformation. version or checksum (but not both) may be added. In case of checksum, this is a Docker digest, not a Seamless checksum. Status: Solved for 0.7.

Conda

The same what goes in an environment.yml file, i.e. a list of channels and a list of dependencies. The channels are optional. The dependencies may contain version specifications. No need for a name field. Seamless (or any other software that will execute the transformer) will interrogate Conda to check if the dependencies are installed. Seamless will refuse to install new packages, but other software may. Status: Solved for 0.7.

Capabilities

Each Seamless instance may have a list of abstract capabilities registered. It can execute transformations that require (a subset of) those capabilities, and no others. Capabilities can be major or minor. Major capabilities are analogous to images: to satisfy major capability [A, B], you would normally have to create a merged Docker image of A and B. Minor capabilities are analogous to packages, but more abstract, as they are not necessarily limited to conda packages. With each release of Seamless, a concrete meaning of each capability is defined. Therefore, individual capabilities do not a version number, but all capabilities together refer to a Seamless release number. Status Solved for 0.7.

Powers

A transformation that requests a power must be granted that power. There is no other way to execute the code in the transformation, but the checksum can of course be substituted with equivalent code that does not require it. For now, the following powers are (planned to be) supported:

(*) = A filename is provided to ast.parse, but this does not change the AST dump, only the error messages.

sjdv1982 commented 3 years ago

Mostly done. To be done for 0.7:

TODO: adapt the document to mention "which"

sjdv1982 commented 3 years ago

Bash and docker transformers have now been unified. In the document, remove the version for modules, may not be a good idea.

sjdv1982 commented 3 years ago

"image" is now renamed to "docker"