hammerlab / ketrew

Keep Track of Experimental Workflows
http://www.hammerlab.org/docs/ketrew/master/index.html
Apache License 2.0
76 stars 10 forks source link

Make it easier to pass information between workflow nodes #284

Open ihodes opened 8 years ago

ihodes commented 8 years ago

For example an ID or data returned from a web service needs to be passed to subsequent nodes for further processing. Right now the sanest way to do this is create a file with a known name and location and pass it around; this has a lot of moving parts for something that could be made simple.

I propose something like a Pipe module that creates names, persistent pipes between workflow nodes. This could be implemented as a named file, and would more easily be implemented using Biokepi's newer API that uses products; a Pipe would have a product method, as well as a read/write interface.

hammer commented 8 years ago

Ketrew's engine also has a database to store workflow metadata; I could imagine making this database available to workflows as scratch space for key/value storage. I could also imagine that going horribly wrong as workflows could hose the engine's storage and, in the setting of Ketrew being used as a shared server, everyone could get killed.

hammer commented 8 years ago

It's probably best to not include this functionality in the workflow engine and just allow users to decide how to handle state between nodes themselves. For our lab's purposes, you could set up a little Redis server or something if you don't like files. Can you link to the code for the specific use case that motivated this issue?

ihodes commented 8 years ago

The trouble is that passing state is a rather clunky currently, though the newer API certainly makes it easier. For an example of storing the output of a call to Cycledash, see below. A "witness" file (a @smondet -ism) is used as the "product" of the workflow node, containing the HTTP response of the request made by the node's build process. This witness file can then be depended on (not shown, much more complicated; requires doing some shell-escpaing and cating in the witness file) and the path of it used in subsequent nodes.

A Program.t compatible Pipe product could remove a lot of this boilerplate & dealing with shell-escaping, and make it clearer to a read what the pipeline is trying to do when passing output through nodes.

let post_bam_to_cycledash ~project_name ~bam_path ~edges ~cycledash_url =
    let open KEDSL in
    let open Biokepi.Run_environment in
    let open Biokepi.Workflow_utilities in
    let name = sprintf "POST BAM to Cycledash: %s" bam_path in
    let witness_file = bam_path ^ ".cycledash-post-bam-witness" in
    let rm_witness = Remove.file ~run_with:Demeter.machine witness_file in
    let host = Demeter.host in
    let make =
      Machine.quick_command Demeter.machine Program.(
          shf
            {s|curl -f -H 'Content-Type: application/json' %s/api/bams -d '{"uri": "%s", "projectName": "%s"}' > %s |s}
            cycledash_url bam_path project_name witness_file
        )
    in
    workflow_node ~name ~edges:(edges @ [on_failure_activate rm_witness]) ~make (single_file ~host witness_file)