nbren12 / dask.targeted

Enabling on-disk and other persistence mechanisms in dask
2 stars 1 forks source link

=============================== dask.targeted

Using dask with luigi targets.

There are many tools out there which are used for automatic scientific workflows by having the users specify some sort of dependency graph between files. These tools are especially popular in bioinformatics, but are broadly applicable in many disciplines. Some popular tools for this include make, luigi, Snakemake, nextflow, and many others. I like these tools a lot, but it can be tedious breaking up a lengthy python script into pieces and manually specifing the dependency graph by hand. This effort seems especially redudant because dask already builds a computational graph behind the scenes.

The basic idea of this package is to use dask to build a graph between persistent objects using an interface similar to dask.delayed. Luigi already has wide-array of targets that represent different types of persistent data stores (e.g. database, file system, etc), that we can leverage to make this work easier.

This package provides the following syntax for building a graph between persistent objects

.. code:: python

from dask.delayed import delayed
from targeted.simple import TargetedCallback, targeted
import luigi

@delayed
def fun():
    print("Calling fun")
    return "hello world"

def string_read(obj):
    return obj.open("r").read()

def string_write(target, obj):
    with target.open("w") as f:
        f.write(obj)

tgt = luigi.LocalTarget("hello.txt")

with TargetedCallback():

    a = fun()
    b = targeted(tgt, reader=string_read, writer=string_write)(a)

    print(b.compute()) # calls fun and saves to hello.txt
    print(b.compute()) # loads text from hello.txt