Pipelines and interaction with build tools

apdavison commented 9 years ago

Rob Beagrie reported in a comment thread on the Software Carpentry blog (http://software-carpentry.org/2012/10/wanted-an-entry-level-provenance-library/comment-page-1/#comment-45629):

"My workflow is a pipeline of lots of different scripts, which are joined together using Make, and I couldn’t figure out a way to run Make using Sumatra"

This use case needs some thought - ideally we would like to capture both the overall computation (executable: make) as well as all the intermediate steps

Imported from Bitbucket issue 103, created by apdavison on 10-08-2012 at 17:29, last modified: 04-04-2013 at 19:08

Psirus commented 6 years ago

I would like to help with #358, which led me here.

Now, I'm not sure what a solution would look like. You can run make with Sumatra, but you will only get one record. How would you turn each "pipeline" or command into a record? Also, this should certainly not be specific to makefiles, one could just as well use a shell script or Python. And then you would have to detect whether the Python script is a number of pipelines, or an ordinary simulation.

The only thing I can think of is to introduce more subcommands/options, so that you can tell Sumatra which parts should be seen as one record, and on which previous steps it depends.

apdavison commented 6 years ago

There are two aspects to consider: (1) representation of pipelines in the Sumatra database ("record store") (2) discovery/capture of steps in a pipeline.

For (1), the current schema does contain enough information to reconstruct pipelines, based on shared inputs/outputs (i.e. where the output from one record is the input to another). What is needed is tools to visualize such pipelines (I think there is an issue relating to this, but I don't have time to search right now). We should also think about having a more explicit representation of pipelines, either by making records hierarchical, or adding a pipeline/workflow class to the schema.

Aspect (2) is the most challenging. A fully automated solution might be possible, using strace or related tools (e.g. CDE, ReproZip) to capture each OS process separately. A more manual approach would be easier, i.e. using smt commands within makeflles or shell scripts, or the Sumatra API within Python top-level scripts, to capture the individual steps. The challenge there is to detect that Sumatra is being used within an existing Sumatra process.

Psirus commented 6 years ago

Regarding (1), I can't find the issue you're referring to. Anyway, visualizing pipelines probably needs two representations, on the command line and in the web interface. In the CLI, I could imagine something like git log --graph. On the web interface, I don't have much previous experience, so if you have any suggestions I'd be happy to hear them. Should I open a separate issue?

As for hierarchical records vs. pipeline class, the latter seems more intuitive to me, but I'm not familiar enough with the Sumatra codebase to make this decision.

For the second part I would prefer something more manual. While reproducibility would probably be better with an automated solution, you often end up "drowning" in data. I'd much rather find out quickly which version of a linear algebra library was used, for example, than wade through mountains of data with every core OS utility version and stack trace.

As for checking when Sumatra is used within Sumatra, one could use psutil:

import psutil

process = psutil.Process()
pid = process.pid
while pid != 1:
    process = process.parent()
    pid = process.pid
    print(process.name())

would print for example:

zsh
termite
sh
xmonad-x86_64-linux
xfce4-session
sh
lightdm
lightdm
systemd

If smt is in this list, then you would be in a child sumatra process.

open-research / sumatra

Pipelines and interaction with build tools #107