secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
105 stars 36 forks source link

phASER duplicates VCF meta-information FORMAT lines in output VCF #52

Closed tinyheero closed 5 years ago

tinyheero commented 5 years ago

The phaser.py script appears to duplicate the VCF meta-information FORMAT lines in the output VCF file. For instance, if I had the following test.vcf.gz VCF file:

##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  amsim_pcawg_phaseable
chr1  877831  . T C 40  PASS  mut_type=germline GT  1/1
chr1  877868  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881019  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881025  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881602  . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45812 . C A 40  PASS  mut_type=somatic  GT  0/1
chr2  45862 . G T 40  PASS  mut_type=somatic  GT  0/1
chr2  45875 . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45895 . A G 40  PASS  mut_type=germline GT  1/1
chr2  45946 . C T 40  PASS  mut_type=somatic  GT  0/1

Running the following command:

python lib/phaser/phaser/phaser.py \
        --vcf test.vcf.gz \
        --bam data/bams/amsim_pcawg_phaseable.sorted.cleaned.dups_removed.bam \
        --paired_end 1 \
        --mapq 20 \
        --baseq 20 \
        --sample amsim_pcawg_phaseable \
        --include_indels 1 \
        --blacklist data/phaser_test_data/hg19_hla.bed \
        --haplo_count_blacklist data/phaser_test_data/hg19_haplo_count_blacklist.bed \
        --threads 1 \
        --o test/phaser

Produces the output test/phaser.vcf.gz VCF file with the GT meta-information line duplicated.

##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PG,Number=1,Type=String,Description="phASER Local Genotype">
##FORMAT=<ID=PB,Number=1,Type=String,Description="phASER Local Block">
##FORMAT=<ID=PI,Number=1,Type=String,Description="phASER Local Block Index (unique for each block)">
##FORMAT=<ID=PM,Number=1,Type=String,Description="phASER Local Block Maximum Variant MAF">
##FORMAT=<ID=PW,Number=1,Type=String,Description="phASER Genome Wide Genotype">
##FORMAT=<ID=PC,Number=1,Type=String,Description="phASER Genome Wide Confidence">
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  amsim_pcawg_phaseable
chr1  877831  . T C 40  PASS  mut_type=germline GT:PG:PB:PI:PW:PC:PM  1/1:1/1:.:.:1/1:.:.
chr1  877868  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0/1:.:.:0/1:.:.
chr1  881019  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1  881025  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1  881602  . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0/1:.:.:0/1:.:.
chr2  45812 . C A 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45862 . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45875 . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45895 . A G 40  PASS  mut_type=germline GT:PG:PB:PI:PW:PC:PM  1/1:1/1:.:.:1/1:.:.
chr2  45946 . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0

This causes issues in downstream VCF processing tools (e.g. vcfR) that expect there to be no duplicate meta-information lines.

secastel commented 5 years ago

Thanks for bring ing this up. I was able to replicate the bug and have fixed the issue in the latest phASER update.