The phaser.py script appears to duplicate the VCF meta-information FORMAT lines in the output VCF file. For instance, if I had the following test.vcf.gz VCF file:
##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT amsim_pcawg_phaseable
chr1 877831 . T C 40 PASS mut_type=germline GT 1/1
chr1 877868 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881019 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881025 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881602 . C T 40 PASS mut_type=somatic GT 0/1
chr2 45812 . C A 40 PASS mut_type=somatic GT 0/1
chr2 45862 . G T 40 PASS mut_type=somatic GT 0/1
chr2 45875 . C T 40 PASS mut_type=somatic GT 0/1
chr2 45895 . A G 40 PASS mut_type=germline GT 1/1
chr2 45946 . C T 40 PASS mut_type=somatic GT 0/1
Produces the output test/phaser.vcf.gz VCF file with the GT meta-information line duplicated.
##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PG,Number=1,Type=String,Description="phASER Local Genotype">
##FORMAT=<ID=PB,Number=1,Type=String,Description="phASER Local Block">
##FORMAT=<ID=PI,Number=1,Type=String,Description="phASER Local Block Index (unique for each block)">
##FORMAT=<ID=PM,Number=1,Type=String,Description="phASER Local Block Maximum Variant MAF">
##FORMAT=<ID=PW,Number=1,Type=String,Description="phASER Genome Wide Genotype">
##FORMAT=<ID=PC,Number=1,Type=String,Description="phASER Genome Wide Confidence">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT amsim_pcawg_phaseable
chr1 877831 . T C 40 PASS mut_type=germline GT:PG:PB:PI:PW:PC:PM 1/1:1/1:.:.:1/1:.:.
chr1 877868 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0/1:.:.:0/1:.:.
chr1 881019 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1 881025 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1 881602 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0/1:.:.:0/1:.:.
chr2 45812 . C A 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45862 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45875 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45895 . A G 40 PASS mut_type=germline GT:PG:PB:PI:PW:PC:PM 1/1:1/1:.:.:1/1:.:.
chr2 45946 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
This causes issues in downstream VCF processing tools (e.g. vcfR) that expect there to be no duplicate meta-information lines.
The
phaser.py
script appears to duplicate the VCF meta-information FORMAT lines in the output VCF file. For instance, if I had the followingtest.vcf.gz
VCF file:Running the following command:
Produces the output
test/phaser.vcf.gz
VCF file with the GT meta-information line duplicated.This causes issues in downstream VCF processing tools (e.g. vcfR) that expect there to be no duplicate meta-information lines.