hammerlab / biokepi

Bioinformatics Ketrew Pipelines
Apache License 2.0
27 stars 4 forks source link

Add varlens #119

Open iskandr opened 8 years ago

iskandr commented 8 years ago

Want to run this command from within Biokepi pipelines (to merge multiple VCFs):

varlens-variants 
--variants MOSAIK_Mutect.vcf 
--variants MOSAIK_Strelka.vcf
--variants BWA_Mutect.vcf 
--variants BWA_Strelka.vcf
--reads /path/to/DNA_MOSAIK.bam 
--reads /path/to/DNA_BWA.bam
--reads /path/to/RNA_HISAT.bam 
--include-variant-source 
--include-read-evidence 
--include-gene 
--include-effect

This will merge the variants found in 4 VCFs and annotate each with its read evidence support from the 2 originating DNA alignments and an alignment of the RNAseq reads.

iskandr commented 8 years ago

Another option: --context 50 will add a surrounding context of the 50 nucleotides around a variant (can be useful for filtering homopolymer regions).

smondet commented 8 years ago

Does https://github.com/hammerlab/varlens/issues/12 block this?

Is there a released version with some installation/usage documentation?

timodonnell commented 8 years ago

Good call, I can look into this. There is not currently a released and documented version but I should make one. (Currently have to run pip install . from the checkout, and docs are just what you get from running with -h.) I should be able to do this next week, but if we end up actively blocked on this please let me know.

timodonnell commented 8 years ago

Varlens has been revamped and documented and should hopefully be more usable now. I haven't done a pip release but will do that soon. See the README here for basic examples https://github.com/hammerlab/varlens. Each tool should also have reasonable help now.

Here's an example command that does what is asked for in this ticket:

$ varlens-variants \
     test/data/CELSR1/vcfs/vcf_1.vcf \
     test/data/CELSR1/vcfs/vcf_2.vcf \
  --reads \
       test/data/CELSR1/bams/bam_1.bam \
       test/data/CELSR1/bams/bam_2.bam \
       test/data/CELSR1/bams/bam_3.bam \
  --include-read-evidence \
  --include-gene \
  --include-effect \
  --include-context \
  --reference ~/sinai/data/human_g1k_v37_reformatted.fasta

Output:

genome,contig,interbase_start,interbase_end,ref,alt,sources,effect,gene,context_5_prime,context_3_prime,context_mutation,1.bam_count_num_alt,1.bam_count_num_ref,1.bam_count_total_depth,2.bam_count_num_alt,2.bam_count_num_ref,2.bam_count_total_depth,3.bam_count_num_alt,3.bam_count_num_ref,3.bam_count_total_depth
GRCh37,22,21829554,21829555,T,G,1.vcf,non-coding-transcript,PI4KAP2,CCGTGTCCAACATGA,AGTGACCAGGGAGAC,T>G,0,0,0,0,0,0,0,0,0
GRCh37,22,46931059,46931060,A,C,1.vcf,p.S670A,CELSR1,CCCCCCATGAGCTCC,CCACCAGCGTGTCCA,T>G,0,222,329,0,93,93,0,279,323
GRCh37,22,46931061,46931062,G,A,1.vcf 2.vcf,p.S669F,CELSR1,CGCCCCCCATGAGCT,CTCCACCAGCGTGTC,C>T,0,330,330,2,91,93,1,321,324
GRCh37,22,50636217,50636218,A,C,1.vcf,intronic,TRABD,GCAGCCCCGCAGGGA,GGGCAACGGGCTGGG,T>G,0,0,0,0,0,0,0,0,0
GRCh37,22,50875932,50875933,A,C,1.vcf,splice-acceptor,PPP6R2,TAGTCAGAGAAGGCC,GGGAGGGAGGGAGGG,T>G,0,0,0,0,0,0,0,0,0
GRCh37,22,45309892,45309893,T,G,2.vcf,p.T214P,PHF21B,ATGGGGAGGGAGGGG,GAGGGGAAGAGAGGA,T>G,0,0,0,0,0,0,0,0,0
timodonnell commented 8 years ago

It's up on pypi now: https://pypi.python.org/pypi/varlens