Bunny GDC support - Githubissues

sinisa88 commented 7 years ago

Umbrella for specific issues regarding support for GDC workflow.

Tools with no inputs: #177

jeremiahsavage commented 7 years ago

The merge_sqlite tool of the GDC DNASeq workflow depends upon this feature: https://github.com/rabix/bunny/issues/193

jeremiahsavage commented 7 years ago

The picard_mergesamfiles tool depends upon using $(self.basename) in the workflow ( https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L632 ).

Issue filed as https://github.com/rabix/bunny/issues/197

jeremiahsavage commented 7 years ago

The samtools_idxstats_to_sqlite tool depends upon a literal valueFrom passed to the metrics subworkflow: https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L603

Issue filed as https://github.com/rabix/bunny/issues/202

StarvingMarvin commented 7 years ago

I've tried to run GDC (transform.cwl) workflow with code from feature/gdc branch, but got the following error:

zcat SRR622461_2.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_2.fq.gz: not found

Good news is that I got the same error with cwltool :) but I'm not sure what should I do to run this workflow successfully.

jeremiahsavage commented 7 years ago

Thanks for testing.

I currently don't see that branch on our repo: https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/feature/gdc (gives 404)

The master branch should run transform.cwl without error using cwltool https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/master

If you are familiar with dockstore, that might be an easier way to run a workflow from the master branch: https://dockstore.org/workflows/NCI-GDC/gdc-dnaseq-cwl/GDC_DNASeq

StarvingMarvin commented 7 years ago

It's a bunny branch with some gdc related fixes.

jeremiahsavage commented 7 years ago

Ok. If you've found some fixes needed in the cwl, I'd certainly like to look at them for possible incorporation.

For now, the below command is tested with work with cwltool version 1.0.20170309164828

mkdir tmp cache
nohup cwltool --debug --tmpdir-prefix tmp/ --cachedir cache/ --custom-net host https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http.cwl https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http_NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.json &

StarvingMarvin commented 7 years ago

@jeremiahsavage I tried running it that way too, but failed with same error. Will continue debugging...

jeremiahsavage commented 7 years ago

@StarvingMarvin I reproduced your error. It was the same error I encountered here: https://github.com/rabix/bunny/issues/140

Which was fixed in the gdc cwl in this commit: https://github.com/NCI-GDC/gdc-dnaseq-cwl/pull/45

It looks like cwltool has become more strict in properly catching this error, while it used to be more lenient. A current checkout from master should fix this.

jeremiahsavage commented 7 years ago

bunny hangs when attempting to merge BAMs from multiple arrays. Issue filed as https://github.com/rabix/bunny/issues/215

StarvingMarvin commented 7 years ago

Yup. I was having for some reason a gdc pipeline directory without .git. And it was in another dir, that was a git repo, so when I did git pull, I was updating the wrong thing... Anyhow, I got to a point of picard failing because it run out of memory on my lap top, so I'll run it on another machine and than I'll try to repeat again with bunny...

simonovic86 commented 7 years ago

I've implemented support for scattering over empty arrays and now Bunny executes the GDC workflow. I'll merge changes from bug/empty-list-scatter into the develop branch ASAP.

Here are the results:

{
  "harmonized_bam" : {
    "basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "checksum" : "sha1$57ec46a349304aa38fcd9665ca8b3ac07f988c61",
    "class" : "File",
    "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
    "format" : "edam:format_2572",
    "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "nameext" : "bam",
    "nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
    "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "secondaryFiles" : [ {
      "basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "checksum" : "sha1$c7892e603ed183288df74680ea8451a2e82502d1",
      "class" : "File",
      "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
      "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "nameext" : "bai",
      "nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
      "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "secondaryFiles" : [ ],
      "size" : 4280832
    } ],
    "size" : 314389542
  },
  "sqlite" : {
    "basename" : "123e4567-e89b-12d3-a456-426655440000.db",
    "checksum" : "sha1$a90f05691220aebec4ce7c05fafaa0271567ccc2",
    "class" : "File",
    "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite",
    "format" : "edam:format_3621",
    "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
    "nameext" : "db",
    "nameroot" : "123e4567-e89b-12d3-a456-426655440000",
    "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
    "secondaryFiles" : [ ],
    "size" : 2518016
  }
}

StarvingMarvin commented 7 years ago

One minor issue we had to fix: NCI-GDC/gdc-dnaseq-cwl#51

jeremiahsavage commented 7 years ago

@simonovic86 Fantastic! That fix also enabled me to run the transform to completion. I ran the bunny-generated metrics/sqlite file through a validator, which showed the BAM file contains the same alignments generated with cwltool. It's a highly parallel engine. Thank you.

I'm trying next to run our internal ETL process, which wraps the transform with curl and aws cp commands. But it looks like, by default, I can't get network traffic out of the docker containers launched by bunny (Could not resolve host errors). With cwltool, I can add a cwltool --custom-net host to the command line, which is converted to a docker run --net=host parameter. I've looked in the config options https://github.com/rabix/bunny/blob/master/rabix-backend-local/config/core.properties but don't see one to set docker to use host networking. Is there a way to set that? Thanks.

StarvingMarvin commented 7 years ago

@jeremiahsavage bunny doesn't do anything to prevent network connectivity from container. Let's take a step back and figure out what the actual problem is. I would dare to guess that the thing you are trying to fetch is on the internal network. If so, than either a docker daemon is configured to provide containers with some public DSN server, or maybe a resolv.conf file ended up inside the image and is messing things up.

If I'm understanding docker networking correctly, the only place where using host network should matter is when binding ports In that case, relying on command line options for it to work would compromise app portability, so I'm a bit disappointed that the reference implementation does this. Proper way would be introducing a new requirement for it, or extending a DockerRequirement.

jeremiahsavage commented 7 years ago

@StarvingMarvin The docker image in this case is minimal: https://github.com/NCI-GDC/curl_docker/blob/master/Dockerfile

I agree specifying this in cwl would probably be the best way to go. For now, I am able to get this tool to work with the following change to bunny. https://github.com/jeremiahsavage/bunny/commit/4cfec1200bb30feaed1f0f2a712e8b25e5fb4a67

as documented https://docs.docker.com/engine/userguide/networking/ I think "bridge" is the default mode.

StarvingMarvin commented 7 years ago

@jeremiahsavage I'm still confused about what is about your networking situation that demands host network? What are you trying to fetch that can't be accessed through a bridge network? Which of those two commands fail curl or aws cp? What is a dns setting of docker daemon?

jeremiahsavage commented 7 years ago

@StarvingMarvin We have had to use host networking instead of bridge networking ever since switching to https://apt.dockerproject.org/repo/ instead of the older 1.6 docker in Ubuntu's Trusty. It seems to be a regression in that build, or a change made in docker after version 1.6 is preventing us from using bridge.

curl is the command that reliably fails. I believe aws will fail, as well. But there is a separate issue there I am narrowing down.

rabix / bunny

Bunny GDC support #176