grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

Dealing with multiple potential outputs #102

Closed olgabot closed 5 years ago

olgabot commented 5 years ago
// Convert SRA files to FastQ
// Recommended flags from https://edwards.sdsu.edu/research/fastq-dump/
// and Trinity documentation:
// > If your data come from SRA, be sure to dump the fastq file like so:
// >    SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra
func FastqDump(sra file) = {
    outdir := exec(image := fastq_dump, cpu := 1,
                    disk := fastq_dump_disk*GiB) (outdir dir) {"
        fastq-dump --outdir {{outdir}} --gzip \
            --skip-technical  --readids --read-filter pass \
            --dumpbase --split-3 --clip --defline-seq '@$sn[_$rn]/$ri' \
            --split-files \
            {{sra}}
    "}
    x := trace(outdir)

    val (singletons, _) = dirs.Pick(outdir, "*pass.fastq.gz")
    val (R1, _) = dirs.Pick(outdir, "*pass_1.fastq.gz")
    val (R2, _) = dirs.Pick(outdir, "*pass_2.fastq.gz")

    {singletons, R1, R2}
}

In reality, any one of singletons, R1 or R2 could not exist in the directory, as in the example below:

log.SRX1460488-     cpu mean=0.2 max=0.4
log.SRX1460488-     mem mean=120.3MiB max=134.6MiB
log.SRX1460488-     disk mean=628.1MiB max=1.2GiB
--
log.SRX1460488:     {{outdir}} =
log.SRX1460488-         0_pass.fastq.gz sha256:b877baab86d71407e6ac85a22ee5583b4f6d193a7e2945b8eb96d364572a3131 2.0GiB
log.SRX1460488- profile:
log.SRX1460488-     cpu mean=1.0 max=1.0
log.SRX1460488-     mem mean=29.4MiB max=40.5MiB
log.SRX1460488-     disk mean=1006.2MiB max=2.0GiB
--
log.SRX1460489:     {{outdir}} =
log.SRX1460489-         2065155/SRR2971104.sra sha256:4c2325b3b361110c964b2964a0934c8b053d1c555fab7968cc491cc4ecd8f321 16.7GiB
log.SRX1460489- profile:
log.SRX1460489-     cpu mean=0.2 max=0.8
log.SRX1460489-     mem mean=157.8MiB max=181.9MiB
log.SRX1460489-     disk mean=8.2GiB max=16.6GiB
--
log.SRX1460489:     {{outdir}} =
log.SRX1460489-         0_pass_1.fastq.gz sha256:83886112ae5bd59a61f9e190180b8ec3a45988dd2ef4cdca253cce4a7f5f4930 10.9GiB
log.SRX1460489-         0_pass_2.fastq.gz sha256:e0a9e36921e3be66b2ba424054722dc0820dd0681fe4b4c2ab917a84f6d9237b 11.0GiB
log.SRX1460489- profile:
log.SRX1460489-     cpu mean=1.0 max=1.0
log.SRX1460489-     mem mean=35.2MiB max=62.0MiB

So the code is erroring out because there aren't *pass_{1,2}.fastq.gz files in some cases, and no *pass.fastq.gz files in other cases:

 Wed  6 Feb - 07:17  ~/code/kmer-hashing/ancient_stem_cells/ephydatia_fluviatilis/individual_srx_ids   olgabot/ancient-stem-cells 2☀ 1● 2‒ 
  grep fastq.gz log*
log.SRX1460483:         0_pass.fastq.gz sha256:7cdc287803201d0205e6876ba9cc0d864550d3a93cc329189473d2381858e3ee 2.4GiB
log.SRX1460483:2019/02/05 12:07:43 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460483:2019/02/05 12:07:43 run SRX1460483: state: done error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460484:         0_pass.fastq.gz sha256:5549edc35851c7251ad1ee123036ef141cadbe22a7639a025a554386779756c1 2.1GiB
log.SRX1460484:2019/02/05 12:04:37 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_2.fastq.gz
log.SRX1460484:2019/02/05 12:04:37 run SRX1460484: state: done error evaluation error: dirs.Pick: no files matched *pass_2.fastq.gz
log.SRX1460485:         0_pass.fastq.gz sha256:22bc55ac024455a036f85c5199b0601390f88ae3fbd564133e0d5c9f0cd456e3 2.2GiB
log.SRX1460485:2019/02/05 12:07:38 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_2.fastq.gz
log.SRX1460485:2019/02/05 12:07:38 run SRX1460485: state: done error evaluation error: dirs.Pick: no files matched *pass_2.fastq.gz
log.SRX1460486:         0_pass.fastq.gz sha256:3e1fc1cacfda13e519cdf6e1bea418cb5c28d4798b565cc4a39a630d134daf76 2.1GiB
log.SRX1460486:2019/02/05 12:06:01 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460486:2019/02/05 12:06:01 run SRX1460486: state: done error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460487:         0_pass.fastq.gz sha256:2d422d412851d59f9ae1c7dd3d0bef5dde3e341a72806a383b5f0e4c4cb1e9db 2.1GiB
log.SRX1460487:2019/02/05 12:06:53 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460487:2019/02/05 12:06:53 run SRX1460487: state: done error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460488:         0_pass.fastq.gz sha256:b877baab86d71407e6ac85a22ee5583b4f6d193a7e2945b8eb96d364572a3131 2.0GiB
log.SRX1460488:2019/02/05 12:04:01 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460488:2019/02/05 12:04:01 run SRX1460488: state: done error evaluation error: dirs.Pick: no files matched *pass_1.fastq.gz
log.SRX1460489:         0_pass_1.fastq.gz sha256:83886112ae5bd59a61f9e190180b8ec3a45988dd2ef4cdca253cce4a7f5f4930 10.9GiB
log.SRX1460489:         0_pass_2.fastq.gz sha256:e0a9e36921e3be66b2ba424054722dc0820dd0681fe4b4c2ab917a84f6d9237b 11.0GiB
log.SRX1460489:2019/02/05 14:53:42 marking run done after nonrecoverable error evaluation error: dirs.Pick: no files matched *pass.fastq.gz
log.SRX1460489:2019/02/05 14:53:42 run SRX1460489: state: done error evaluation error: dirs.Pick: no files matched *pass.fastq.gz

I'd like to string this download_sra.rf script to more stuff down the line, but some of them require R1/R2 distinction, and others don't. What's the best way to deal with these sorts of outputs?

mariusae commented 5 years ago

The newly added sum types would be a good way to model this.