rabix / rabix

[Historical] Reproducible Analyses for Bioinformatics
http://rabix.io
GNU Affero General Public License v3.0
106 stars 26 forks source link

rabix calling grep an extra time w/ no args for some reason #133

Open jstjohn opened 9 years ago

jstjohn commented 9 years ago

My program using the ref implementations of grep and wc from the festival of genomics:

#!/usr/bin/env cwl-runner
class: Workflow
requirements:
  - class: ScatterFeatureRequirement
inputs:
  - id: "#mutation"
    type: string
  - id: "#normalin"
    type: File
  - id: "#tumorin"
    type: File
outputs:
  - id: "#outfile"
    type: File
    source: "#wc.outfile"
steps:
  - id: "#greptumor"
    run: {import: grep.cwl.yaml}
    #scatter: "#grep.infile"
    inputs:
      - id: "#grep.pattern"
        source: "#mutation"
      - id: "#grep.infile"
        source: "#tumorin"
    outputs:
      - id: "#greptumor.outfile"
  - id: "#grepnormal"
    run: {import: grep.cwl.yaml}
    #scatter: "#grep.infile"
    inputs:
      - id: "#grep.pattern"
        source: "#mutation"
      - id: "#grep.infile"
        source: "#normalin"
    outputs:
      - id: "#grepnormal.outfile"
  - id: "#wc"
    run: {import: wc.cwl.yaml}
    inputs:
      - id: "#wc.infile"
        source: ["#grepnormal.outfile",  "#greptumor.outfile"]
    outputs:
      - id: "#wc.outfile"

And here is the output in rabix:

> rabix mutationfinder.cwl.yaml -v -- --mutation GCATCCA --normalin normal.fastq --tumorin tumor.fastq 
INFO:rabix.cli.cli_app:Running: grep GCATCCA /Users/john/workspaces/commonwl_examples/t790m_detector/tumor.fastq > out.txt
INFO:rabix.common.models:File grepnormal-process_144239_0 created.
INFO:rabix.cli.cli_app:Running: grep GCATCCA /Users/john/workspaces/commonwl_examples/t790m_detector/normal.fastq > out.txt
INFO:rabix.common.models:File grepnormal-process_144239_1 created.
INFO:rabix.cli.cli_app:Running: grep > out.txt
INFO:rabix.common.models:File greptumor-process_144239 created.
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
    [-e pattern] [-f file] [--binary-files=value] [--color=when]
    [--context[=num]] [--directories=action] [--label] [--line-buffered]
    [--null] [pattern] [file ...]
Command failed with exit status 2

vs successful output in cwltool:

> cwltool mutationfinder.cwl.yaml --mutation GCATCCA --normalin normal.fastq --tumorin tumor.fastq 
/anaconda/bin/cwltool 1.0.20151026181844
[job 4386250704] /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmp4tkznh$ grep GCATCCA /Users/john/workspaces/commonwl_examples/t790m_detector/tumor.fastq > /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmp4tkznh/out.txt
[job 4386357136] /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmpc0c7ga$ grep GCATCCA /Users/john/workspaces/commonwl_examples/t790m_detector/normal.fastq > /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmpc0c7ga/out.txt
[job 4386356176] /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmppgB31K$ wc -l /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmpc0c7ga/out.txt /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmp4tkznh/out.txt > /var/folders/8r/g00jq11j2yb586cz04tfpydc0000gn/T/tmppgB31K/out.txt
[workflow 4404917968] outdir is /Users/john/workspaces/commonwl_examples/t790m_detector
Final process status is success
{
    "outfile": {
        "path": "/Users/john/workspaces/commonwl_examples/t790m_detector/out.txt", 
        "checksum": "sha1$60e96336589bc40d047756820f8eed0e4c459ef1", 
        "class": "File", 
        "size": 167
    }
}
StarvingMarvin commented 8 years ago

Thanks for reporting this issue.

I'm not sure whether the content of grep.cwl.yaml could make any difference, but with some versions of grep and wc I've supplied to this workflow, I couldn't reproduce your exact error. I couldn't, however, run the workflow successfully either. Change I needed to make in the workflow to run it through rabix is to make different IDs for input steps, changing #grep.pattern to #greptumor.pattern and #grepnormal.pattern etc.

At this point, I'm more confused why did cwltool worked with duplicate IDs and more inclined to fix this issue by erroring out early when duplicate is detected, but I'll have to first confirm, what behavior is mandated by spec.