Open sinisa88 opened 7 years ago
The merge_sqlite
tool of the GDC DNASeq workflow depends upon this feature: https://github.com/rabix/bunny/issues/193
The picard_mergesamfiles
tool depends upon using $(self.basename)
in the workflow ( https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L632 ).
Issue filed as https://github.com/rabix/bunny/issues/197
The samtools_idxstats_to_sqlite
tool depends upon a literal valueFrom
passed to the metrics subworkflow:
https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L603
Issue filed as https://github.com/rabix/bunny/issues/202
I've tried to run GDC (transform.cwl) workflow with code from feature/gdc branch, but got the following error:
zcat SRR622461_2.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_2.fq.gz: not found
Good news is that I got the same error with cwltool :) but I'm not sure what should I do to run this workflow successfully.
Thanks for testing.
I currently don't see that branch on our repo: https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/feature/gdc (gives 404)
The master branch should run transform.cwl without error using cwltool https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/master
If you are familiar with dockstore, that might be an easier way to run a workflow from the master branch: https://dockstore.org/workflows/NCI-GDC/gdc-dnaseq-cwl/GDC_DNASeq
It's a bunny branch with some gdc related fixes.
Ok. If you've found some fixes needed in the cwl, I'd certainly like to look at them for possible incorporation.
For now, the below command is tested with work with cwltool version 1.0.20170309164828
mkdir tmp cache
nohup cwltool --debug --tmpdir-prefix tmp/ --cachedir cache/ --custom-net host https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http.cwl https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http_NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.json &
@jeremiahsavage I tried running it that way too, but failed with same error. Will continue debugging...
@StarvingMarvin I reproduced your error. It was the same error I encountered here: https://github.com/rabix/bunny/issues/140
Which was fixed in the gdc cwl in this commit: https://github.com/NCI-GDC/gdc-dnaseq-cwl/pull/45
It looks like cwltool has become more strict in properly catching this error, while it used to be more lenient. A current checkout from master should fix this.
bunny hangs when attempting to merge BAMs from multiple arrays. Issue filed as https://github.com/rabix/bunny/issues/215
Yup. I was having for some reason a gdc pipeline directory without .git
. And it was in another dir, that was a git repo, so when I did git pull, I was updating the wrong thing... Anyhow, I got to a point of picard failing because it run out of memory on my lap top, so I'll run it on another machine and than I'll try to repeat again with bunny...
I've implemented support for scattering over empty arrays and now Bunny executes the GDC workflow. I'll merge changes from bug/empty-list-scatter into the develop branch ASAP.
Here are the results:
{
"harmonized_bam" : {
"basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
"checksum" : "sha1$57ec46a349304aa38fcd9665ca8b3ac07f988c61",
"class" : "File",
"dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
"format" : "edam:format_2572",
"location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
"nameext" : "bam",
"nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
"path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
"secondaryFiles" : [ {
"basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
"checksum" : "sha1$c7892e603ed183288df74680ea8451a2e82502d1",
"class" : "File",
"dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
"location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
"nameext" : "bai",
"nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
"path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
"secondaryFiles" : [ ],
"size" : 4280832
} ],
"size" : 314389542
},
"sqlite" : {
"basename" : "123e4567-e89b-12d3-a456-426655440000.db",
"checksum" : "sha1$a90f05691220aebec4ce7c05fafaa0271567ccc2",
"class" : "File",
"dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite",
"format" : "edam:format_3621",
"location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
"nameext" : "db",
"nameroot" : "123e4567-e89b-12d3-a456-426655440000",
"path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
"secondaryFiles" : [ ],
"size" : 2518016
}
}
One minor issue we had to fix: NCI-GDC/gdc-dnaseq-cwl#51
@simonovic86 Fantastic! That fix also enabled me to run the transform to completion. I ran the bunny-generated metrics/sqlite file through a validator, which showed the BAM file contains the same alignments generated with cwltool. It's a highly parallel engine. Thank you.
I'm trying next to run our internal ETL process, which wraps the transform with curl
and aws cp
commands. But it looks like, by default, I can't get network traffic out of the docker containers launched by bunny (Could not resolve host
errors). With cwltool, I can add a cwltool --custom-net host
to the command line, which is converted to a docker run --net=host
parameter. I've looked in the config options https://github.com/rabix/bunny/blob/master/rabix-backend-local/config/core.properties but don't see one to set docker to use host networking. Is there a way to set that? Thanks.
@jeremiahsavage bunny doesn't do anything to prevent network connectivity from container. Let's take a step back and figure out what the actual problem is. I would dare to guess that the thing you are trying to fetch is on the internal network. If so, than either a docker daemon is configured to provide containers with some public DSN server, or maybe a resolv.conf file ended up inside the image and is messing things up.
If I'm understanding docker networking correctly, the only place where using host network should matter is when binding ports In that case, relying on command line options for it to work would compromise app portability, so I'm a bit disappointed that the reference implementation does this. Proper way would be introducing a new requirement for it, or extending a DockerRequirement.
@StarvingMarvin The docker image in this case is minimal: https://github.com/NCI-GDC/curl_docker/blob/master/Dockerfile
I agree specifying this in cwl would probably be the best way to go. For now, I am able to get this tool to work with the following change to bunny. https://github.com/jeremiahsavage/bunny/commit/4cfec1200bb30feaed1f0f2a712e8b25e5fb4a67
as documented https://docs.docker.com/engine/userguide/networking/ I think "bridge" is the default mode.
@jeremiahsavage I'm still confused about what is about your networking situation that demands host network? What are you trying to fetch that can't be accessed through a bridge network? Which of those two commands fail curl
or aws cp
? What is a dns setting of docker daemon?
@StarvingMarvin We have had to use host networking instead of bridge networking ever since switching to https://apt.dockerproject.org/repo/ instead of the older 1.6 docker in Ubuntu's Trusty. It seems to be a regression in that build, or a change made in docker after version 1.6 is preventing us from using bridge.
curl
is the command that reliably fails. I believe aws
will fail, as well. But there is a separate issue there I am narrowing down.
Umbrella for specific issues regarding support for GDC workflow.