Project Goals - Githubissues

Background

It is often desirable to run similar computational pipelines on multiple inputs. These pipelines may have multiple steps, each dependent on previous steps, as well as aggregation steps that rely on outputs from multiple different inputs. These pipeline must be reproducible, flexible, and extensible, and must have the ability to log steps, resource use, and provenance of outputs.

There are a lot of systems already built to do this (see the README for just the tip of the iceberg) but so far, none in julia that handle all of the use-cases that I personally need. This discourse discussion from early 2020 has a decent amount of information on other systems, including some components of the julia ecosystem that could be used as templates or provide additional functionality.

In particular:

MrPhelps.jl and Makeitso, possibly Dispatcher.jl are probably the julia examples closest to what I want to do. All are currently inactive.
Dagger.jl provides low-level handling of tasks and DAG creation. I think Hapi.jl will largely be a wrapper around this functionality, providing some conveniences and utilities for working with files

Why do this (again) in julia?

There are so many examples of this functionality, why build another one? The simple answer is that I do everything else in julia, and the cognitive overhead of dealing with another language on top of that is high enough that the effort seems worth it.

But more than that, I think there are a few places that will allow julia to really shine in this space, especially:

reproducibility - with BinaryBuilder, using julia projects to manage software dependencies is ideal. For anything not covered by BinaryBuilder, there's always Conda.jl or CondaPkg.jl
interoperability - PythonCall.jl, RCall.jl, plus utilities for working with C/Fortran code
Shell integration - managing processes in python is a pain. I don't know about other languages, but julia really shines here.

Functionality

My priorities are based entirely on my own needs, and on what I know is possible based on my previous use of (mostly) snakemake. I'm open to suggestions (especially if they come with PRs!). My typical needs are:

Essentials

Specify generic computational pipeline that starts with one or more file inputs, and generates file outputs. This pipeline can be a mix of in-julia steps, or calling out to shell programs.
Automatically match files or groups of files according to a particular pattern, and apply the computational pipeline from step 1 to those files or groups of files.
Specify steps that apply to all outputs from steps 1 and 2. This should be able to wait on all upstream steps, but have options to proceed if a previous step errors or times out.
Run all steps locally, or dispatch to cluster manager (eg SLURM)
- Be able to specify job / task specific resource requirements for cluster jobs
Maintain logs of each step status and resource use. Ability to re-run from the middle, possibly modifying resource requests on cluster jobs (eg. increasing time or memory allotment)
Write generic steps for (1) that can be mix-and-matched / composable for different pipelines.

Nice-to-haves

Nicely formatted (eg html / pdf) reports with visualizations that depend on outputs
integration (even if just examples) with DataDeps.jl, DrWatson.jl, Pluto.jl, and other parts of julia ecosystem
Integration with / ability to read Common Workflow Language

Design Plan

File handling

I have already implemented some basic utilities for identifying files that match certain patterns. Eg.

julia> rgx = build_regex("file{thing1}_{thing2}.txt", (thing1=raw"\\d+", thing2=raw"\\d+"));

julia> readdir("../test/projects/fileglob/")
4-element Vector{String}:
 "file1_1.txt"
 "file1_2.txt"
 "file2_1.txt"
 "file2_2.txt"

julia> glb = glob_pattern("../test/projects/fileglob/", rgx)
4-element Vector{FileDependency}:
 path: ../test/projects/fileglob/file1_1.txt
    params: (thing1 = "1", thing2 = "1")
 path: ../test/projects/fileglob/file1_2.txt
    params: (thing1 = "1", thing2 = "2")
 path: ../test/projects/fileglob/file2_1.txt
    params: (thing1 = "2", thing2 = "1")
 path: ../test/projects/fileglob/file2_2.txt
    params: (thing1 = "2", thing2 = "2")

julia> dep_groups = groupdeps(glb, [:thing1])
2-element Vector{FileDepGroup}:
 FileDepGroup with 2 dependencies
    shared params: (thing1 = "1",)
 FileDepGroup with 2 dependencies
    shared params: (thing1 = "2",)

Still need to add things like isfile, stat and other things that should pass-through to the file. Would also like to be able to integrate with FilePaths.jl here.

Also, it would be good to be able to generate output paths based on these patterns.

Build pipelines

What I'd like to be able to do next is to have a user be able to write julia functions that take FileDepGroups or FileDependencys as arguments, then run some computation, returning FileDepGroups or FileDependencys.

function step1(files::FileDepGroup)
    thing = params(files).thing1
    # do stuff
    return FileDependency("path/to/output_$thing.txt", (; thing1 = thing)) 
end
# etc

@run! dep_groups |> step1 |> step2 |> step3

It would be good to also be able to pass on kwargs, specify the compute environment (eg SLURM parameters) etc.

At one point, I had messed around with Dagger.jl, and had my functions (eg step1()) have at the begining isfile(outputfile) && return outputfile at the beginning so that if a step had already completed, it would get skipped. Having some kind of macro that automates this would be ideal, but I don't have the skills to implement that.

Logging

Should use extensive logging functionality in julia to have a number of log streams.

Overall pipeline, included processes launched / re-started / timed out / errored, files consumed / created, cluster jobs lauched if applicable
Individual step logs from julia
stdout / stderr for each individual step

... there's probably more, but I need to eat

kescobo / Hapi.jl

Project Goals #1