grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

Run multiple run files? #135

Closed niemasd closed 3 years ago

niemasd commented 3 years ago

I'm writing a workflow that produces a separate Reflow run file for each sample I want to analyze (I know I could create a single run file and execute it with parameters for each sample, but I prefer keeping them separate because the run file can serve as a versioned history of how exactly I processed files, which is useful for future reference). For example, imagine I have the following 3 Reflow run files:

sample1.rf
sample2.rf
sample3.rf

I'm able to run them one-by-one as follows:

reflow run sample1.rf
reflow run sample2.rf
reflow run sample3.rf

I want to execute these runs simultaneously. I tried the following:

reflow run sample1.rf sample2.rf sample3.rf

But this doesn't work, and it gives me the following error:

usage of /home/niema/sample1.rf:

I also tried the following:

reflow runbatch sample1.rf sample2.rf sample3.rf

But this also doesn't work, and gives me the following:

open config.json: no such file or directory

I would strongly prefer simply batch-running the individual rf files rather than somehow merging them into a single config.json file. Any guidance would be greatly appreciated 😄

prb2 commented 3 years ago

As you noted, we have reflow runbatch (and the related reflow genbatch) which are the preferred method for running batches.

If you prefer to have separate reflow files, here's one way to do it:

file one.rf contains:

val Main = exec(image := "ubuntu", cpu := 1, mem := 10*MiB) (out file) {"
                sleep 1
                echo "One slept for 1 second" > {{out}}
"}

file two.rf contains:

val Main = exec(image := "ubuntu", cpu := 1, mem := 10*MiB) (out file) {"
                sleep 2
                echo "Two slept for 2 seconds" > {{out}}
"}

file three.rf contains

val Main = exec(image := "ubuntu", cpu := 1, mem := 10*MiB) (out file) {"
                sleep 3
                echo "Three slept for 3 seconds" > {{out}}
"}

file main.rf contains;

val Main = [
        make("one.rf").Main,
        make("two.rf").Main,
        make("three.rf").Main,
]

Then, run reflow on main.rf:

$ reflow run main.rf
...
2021/05/15 10:05:29 total n=3 time=36s
        ident      n   ncache transfer runtime(m) cpu         mem(GiB)    disk(GiB)   tmp(GiB)    requested
        one.Main   1   0      0B       0/0/0      0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 {mem:500.0MiB cpu:1 disk:0B}
        three.Main 1   0      0B       0/0/0      0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 {mem:500.0MiB cpu:1 disk:0B}
        two.Main   1   0      0B       0/0/0      0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0 {mem:500.0MiB cpu:1 disk:0B}

2021/05/15 10:05:29 result: [7d898f09, 637cc100, e9f59af4]

Alternatively, you can look into GNU Parallel which should let you run reflow for each of your source files in parallel. I don't have any experience with this application, but it seems like it will do what you need.

niemasd commented 3 years ago

Thanks for the suggestions! Super helpful!

I was considering using GNU parallel (I use it quite often, and it would actually be the preferred solution), and my only worry was if there would be issues with running many (e.g. thousands) reflow run commands at the same time. Will the multiple reflow run commands not interfere with each other when running simultaneously?

EDIT: I ended up writing a simple script that produces a merged RF file just like your example main.rf from a bunch of user-given RF files, and it seems to work great! I'll stick with this and not mess with GNU parallel because this merged RF file itself is a nice record for keeping track of what was run. Here's the script in case it helps anyone:

https://github.com/niemasd/ViReflow/blob/main/rf_batch.py