grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
964 stars 52 forks source link

Getting `bwa index` to work - aka dealing with bioinf package with naming requirements #90

Closed olgabot closed 5 years ago

olgabot commented 5 years ago

Like many bioinformatics packages, bwa index relies on creating files based on the input filenames. Here's the stdout for running bwa index locally:

``` (bwa) Wed 14 Nov - 19:07  ~/scratch/tick-genome/dna/2018-10-10-dovetail   bwa index ixodes_scapularis_06May2018_kaVVW.fasta [bwa_index] Pack FASTA... 30.99 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=3588917780, availableWord=264529096 [BWTIncConstructFromPacked] 10 iterations done. 99999988 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 199999988 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 299999988 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 399999988 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 499999988 characters processed. [BWTIncConstructFromPacked] 60 iterations done. 599999988 characters processed. [BWTIncConstructFromPacked] 70 iterations done. 699999988 characters processed. [BWTIncConstructFromPacked] 80 iterations done. 799999988 characters processed. [BWTIncConstructFromPacked] 90 iterations done. 899999988 characters processed. [BWTIncConstructFromPacked] 100 iterations done. 999999988 characters processed. [BWTIncConstructFromPacked] 110 iterations done. 1099999988 characters processed. [BWTIncConstructFromPacked] 120 iterations done. 1199999988 characters processed. [BWTIncConstructFromPacked] 130 iterations done. 1299999988 characters processed. [BWTIncConstructFromPacked] 140 iterations done. 1399999988 characters processed. [BWTIncConstructFromPacked] 150 iterations done. 1499999988 characters processed. [BWTIncConstructFromPacked] 160 iterations done. 1599999988 characters processed. [BWTIncConstructFromPacked] 170 iterations done. 1699999988 characters processed. [BWTIncConstructFromPacked] 180 iterations done. 1799999988 characters processed. [BWTIncConstructFromPacked] 190 iterations done. 1899999988 characters processed. [BWTIncConstructFromPacked] 200 iterations done. 1999999988 characters processed. [BWTIncConstructFromPacked] 210 iterations done. 2099999988 characters processed. [BWTIncConstructFromPacked] 220 iterations done. 2199999988 characters processed. [BWTIncConstructFromPacked] 230 iterations done. 2299999988 characters processed. [BWTIncConstructFromPacked] 240 iterations done. 2399999988 characters processed. [BWTIncConstructFromPacked] 250 iterations done. 2499999988 characters processed. [BWTIncConstructFromPacked] 260 iterations done. 2599999988 characters processed. [BWTIncConstructFromPacked] 270 iterations done. 2699999988 characters processed. [BWTIncConstructFromPacked] 280 iterations done. 2799999988 characters processed. [BWTIncConstructFromPacked] 290 iterations done. 2899999988 characters processed. [BWTIncConstructFromPacked] 300 iterations done. 2995671796 characters processed. [BWTIncConstructFromPacked] 310 iterations done. 3080783396 characters processed. [BWTIncConstructFromPacked] 320 iterations done. 3156426804 characters processed. [BWTIncConstructFromPacked] 330 iterations done. 3223654836 characters processed. [BWTIncConstructFromPacked] 340 iterations done. 3283403284 characters processed. [BWTIncConstructFromPacked] 350 iterations done. 3336503828 characters processed. [BWTIncConstructFromPacked] 360 iterations done. 3383695748 characters processed. [BWTIncConstructFromPacked] 370 iterations done. 3425636004 characters processed. [BWTIncConstructFromPacked] 380 iterations done. 3462908564 characters processed. [BWTIncConstructFromPacked] 390 iterations done. 3496032516 characters processed. [BWTIncConstructFromPacked] 400 iterations done. 3525469140 characters processed. 4[BWTIncConstructFromPacked] 410 iterations done. 3551628436 characters processed. [BWTIncConstructFromPacked] 420 iterations done. 3574874836 characters processed. [bwt_gen] Finished constructing BWT in 427 iterations. [bwa_index] 1324.65 seconds elapse. [bwa_index] Update BWT... 14.92 sec [bwa_index] Pack forward-only FASTA... 18.06 sec [bwa_index] Construct SA from BWT and Occ... 469.01 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index ixodes_scapularis_06May2018_kaVVW.fasta [main] Real time: 2068.784 sec; CPU: 1857.620 sec ```

After running bwa index locally, there's quite a few files that get output:

(bwa)
 Wed 14 Nov - 19:41  ~/scratch/tick-genome/dna/2018-10-10-dovetail 
  ll
Permissions Size User Date Modified Name
.rwxr-xr-x  1.8G olga 10 Oct 17:57  ixodes_scapularis_06May2018_kaVVW.fasta
.rwxr-xr-x  8.1M olga 14 Nov 19:33  ixodes_scapularis_06May2018_kaVVW.fasta.amb
.rwxr-xr-x  4.4M olga 14 Nov 19:33  ixodes_scapularis_06May2018_kaVVW.fasta.ann
.rwxr-xr-x  1.8G olga 14 Nov 19:32  ixodes_scapularis_06May2018_kaVVW.fasta.bwt
.rwxr-xr-x  448M olga 14 Nov 19:33  ixodes_scapularis_06May2018_kaVVW.fasta.pac
.rwxr-xr-x  897M olga 14 Nov 19:41  ixodes_scapularis_06May2018_kaVVW.fasta.sa

Here's what I'm trying to do to get bwa index to work:

func Index(fasta file, id string) = {
    // Create a bwt index of the fasta

    d := dirs.Make([id + ".fasta": fasta])

    outdir := exec(image := bwa, cpu := 1) (outdir dir) {"
        cd {{d}}
        bwa index {{id}}.fasta
    "}
    d := trace(d)

     // NOTE: outdir is not being used anywhere, so it wont get evaluated
    outdir := trace(outdir)
    d := trace(d)

    val (index, _) = dirs.Pick(d, "*.bwt")

    // to force outdir evaluation, remove above line and uncomment below.

    val (index, _) = outdir ~> dirs.Pick(d, "*.bwt")

    index
}

But when I inspect either d or outdir, all they have is the fasta!! Where did the output for bwa index go??

(kmer-hashing) 
 Wed 14 Nov - 19:48  ~/code/tick-genome/pre_assembly_qc/full_workflow   origin ☊ olgabot/align-qc ✔ 8☀ 
  make tick2                   
sudo reflow run  -local -localdir /home/olga/scratch/reflow-tmp/ -eval "bottomup"  ../../reflow/pre-assembly.rf -minlength 135 -output s3://tick-genome/dna/2018-06-28/pre-assembly_v3/ -adapter_list ../nextera_pe_adapters.tsv -references z
[sudo] password for olga: 
reflow: run ID: 9444e403
../../reflow/bwa.rf:24:20(bwa.Index.outdir): dir()
../../reflow/bwa.rf:24:20(bwa.Index.outdir): dir()
../../reflow/bwa.rf:21:15(bwa.Index.d): dir("dovetail_2018-05-06.fasta": file(sha256=sha256:3a72c5bb77cb0a133ce77b8055d79878c040bf34de2c7b4f513592d8070ac0ac, size=464355639))
../../reflow/bwa.rf:21:15(bwa.Index.d): dir("jcvi_ise6.fasta": file(sha256=sha256:766bbd89408379f12e644f242511db72854c348f349db59e601d2c42bdcd7d1b, size=1001461460))
../../reflow/bwa.rf:25:15(bwa.Index.d): dir("dovetail_2018-05-06.fasta": file(sha256=sha256:3a72c5bb77cb0a133ce77b8055d79878c040bf34de2c7b4f513592d8070ac0ac, size=464355639))
../../reflow/bwa.rf:25:15(bwa.Index.d): dir("jcvi_ise6.fasta": file(sha256=sha256:766bbd89408379f12e644f242511db72854c348f349db59e601d2c42bdcd7d1b, size=1001461460))
reflow: assoc.Get flow 69a026d0 state FlowLookup {} extern url "s3://tick-genome/dna/2018-06-28/pre-assembly_v3/" deps cef9af32: RequestCanceled: request context canceled
caused by: context canceled
reflow: assoc.Get flow 69a026d0 state FlowLookup {} extern url "s3://tick-genome/dna/2018-06-28/pre-assembly_v3/" deps cef9af32: RequestCanceled: request context canceled
caused by: context canceled
dirs.Pick: no files matched *.bwt
Makefile:18: recipe for target 'tick2' failed
make: *** [tick2] Error 11
olgabot commented 5 years ago

I think got this working, but I had to do some weird filesystem stuff to make the eval actually evaluate.

func Index(fasta_gz file, id string) = {
    // Create a bwt index of the fasta

    d := dirs.Make([id + ".fasta.gz": fasta_gz])

    // BWA accepts gzipped files as input
    outdir := exec(image := bwa, cpu := 1) (outdir dir) {"
        cd {{d}}
        bwa index {{id}}.fasta.gz
        ls -lha {{d}}
        mv {{d}} {{outdir}}
    "}
    // NOTE: outdir is not being used anywhere, so it wont get evaluated
    outdir := trace(outdir)
    d := trace(d)

    // val (index, _) = dirs.Pick(d, "*.bwt")

    // to force outdir evaluation, remove above line and uncomment below.

    // val (index, _) = outdir ~> dirs.Pick(d, "*.bwt")
    outdir
}

But ... outdir still only had the fasta in it, not any of the index files:

../../reflow/bwa.rf:24:20(bwa.Index.outdir): dir("0/dovetail_2018-05-06.fasta": file(sha256=sha256:3a72c5bb77cb0a133ce77b8055d79878c040bf34de2c7b4f513592d8070ac0ac, size=464355639))

Full output:

``` sudo reflow run -local -localdir /home/olga/scratch/reflow-tmp/ -eval "bottomup" ../../reflow/pre-assembly.rf -minlength 135 -output s3://tick-genome/dna/2018-06-28/pre-assembly_v3/ -adapter_list ../nextera_pe_adapters.tsv -references 's3://tick-genome/dna/2018-10-10-dovetail/ixodes_scapularis_06May2018_kaVVW.fasta.gz|s3://tick-genome/dna/2018-10-11_ise6_asm2.2/GCA_002892825.2_ISE6_asm2.2_deduplicated_genomic.fna.gz' -references_ids 'dovetail_2018-05-06|jcvi_ise6' -id tick_1_S1 -read1 s3://czbiohub-seqbot/fastqs/180628_A00111_0168_AHFVJVDMXX/rawdata/tick_1_S1_R1_001.fastq.gz -read2 s3://czbiohub-seqbot/fastqs/180628_A00111_0168_AHFVJVDMXX/rawdata/tick_1_S1_R2_001.fastq.gz reflow: run ID: 6df8f17a ../../reflow/bwa.rf:24:20(bwa.Index.outdir): dir("0/dovetail_2018-05-06.fasta": file(sha256=sha256:3a72c5bb77cb0a133ce77b8055d79878c040bf34de2c7b4f513592d8070ac0ac, size=464355639)) transfers: done: 2 1.4GiB, transferring: 3 401.2GiB, waiting: 0 0B s3:czbiohub-reflow-..->anon:0xc4664d7a40: done: 2 1.4GiB, transferring: 3 401.2GiB, waiting: 0 0B 37m19s 6df8f17a: elapsed: 37m, running:1, transferring:6, completed: 22/59 bwa.Index.outdir: exec ..io/biocontainers/bwa:0.7.4--ha92aebf_0 cd {{d}}.bwa index jcvi_ise6...lha {{d}}.mv {{d}} {{outdir}} 37m19s khmer.Hist: transfer 105.7GiB 37m9s kmergenie.Run: transfer 109.7GiB 37m9s khmer.Hist: transfer 109.7GiB 37m9s kmergenie.Run: transfer 105.7GiB 37m9s khmer.NormalizeByMedian: transfer 185.9GiB 37m2s bwa.Mem: transfer 215.4GiB 1m3s ```

Plus this doesn't seem like best practices to me ... is there a better way?

olgabot commented 5 years ago

I also don't see any of the stdout/stderr normally produced by the bwa index command, even in the execlog. Is there a way to find it?

mariusae commented 5 years ago

You probably want to run the command in outdir. e.g., from the example 1000align:

// g1kv37Indexed is the BWA-indexed version of g1kv37.
val g1kv37Indexed = exec(image := "biocontainers/bwa", mem := GiB, cpu := 1) (out dir) {"
    # Ignore failures here. The file from 1000genomes has a trailer
    # that isn't recognized by gunzip. (This is not recommended practice!)
    gunzip -c {{g1kv37}} > {{out}}/g1k_v37.fa || true
    cd {{out}}
    bwa index -a bwtsw g1k_v37.fa
"}
mariusae commented 5 years ago

I also don't see any of the stdout/stderr normally produced by the bwa index command, even in the execlog. Is there a way to find it?

This is persisted but isn't currently easily accessible via tooling. @prasadgopal is very close to landing a change that makes reflow logs (as well as reflow info) work for non-running tasks as well.

If you have the flow id (the exec sha256), then you can look this up in the dynamodb table that stores the assoc (you can get that from your config, e.g.: assoc: dynamodb,reflow-cache-test).

The entry for the sha256 will contain a "Logs" section. This is the sha256 of the stdout and stderr of the process.

screen shot 2018-11-29 at 4 43 43 pm

Once you have this, you can examine the logs with reflow cat <sha256 of log>.

Again, the tooling around this will massively improve once @prasadgopal lands his change. The goal is to have reflow info, reflow logs, reflow cat, etc. just work for all object types. So if you see an identifier anywhere in reflow, you should be able to query it with reflow info, etc.