jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
400 stars 200 forks source link

Reading/writing vcf breaks VCF header #210

Closed jrderuiter closed 8 years ago

jrderuiter commented 8 years ago

I am trying to use PyVCF to write a simple script that filters VCF records based on certain criteria. However, this script seems to break the header of my VCF file, as bcftools no longer parses the resulting VCF, but has no problem with the VCF that is used as input.

def main(args):
    in_path = args.input
    out_path = args.output or path.splitext(in_path)[0] + '.som.vcf'

    with open(in_path, 'r') as vcf_in:
        reader = vcf.Reader(vcf_in)
        records = iter(reader)

        # Filter records here.

        # Write records to output VCF.
        with open(out_path, 'w') as vcf_out:
            writer = vcf.Writer(vcf_out, template=reader)
            for record in records:
                writer.write_record(record)

Input vcf header: https://gist.github.com/jrderuiter/ca373a8d4b4652379244 Output vcf header: https://gist.github.com/jrderuiter/e1a4b2d5e437f7652e8c

Bcftools errors on output vcf:

Could not parse the header line: "##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-0-g7e26428,Date="Wed Oct 07 09:11:47 CEST 2015",Epoch=1444201907400,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=./out/joint_calls.snps.vcf) mask=(RodBinding name= source=UNBOUND) out=./out/joint_calls.snps.filt.vcf.tmp/joint_calls.snps.filt.vcf filterExpression=[QD >
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive)
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive)
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive)
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive)
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive)
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive)
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive)
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive)
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive)
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive)
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive)
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive)
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive)
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive)
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive)
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive)
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive)
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive)
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive)
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive)
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive)
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive)
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive)
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive)
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive)
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive)
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive)
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive)
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive)
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive)
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive)
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive)
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive)
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive)
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive)
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-2147483647=minGQ=99(inclusive),maxGQ=2147483647(exclusive)
##reference=file://./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa
##source=SelectVariants
##bcftools_normVersion=1.2+htslib-1.2.1
##bcftools_normCommand=norm -m-both --output ./out/joint_calls.merged.filt.pass.norm_tmp.vcf.tmp/joint_calls.merged.filt.pass.norm_tmp.vcf ./out/joint_calls.merged.filt.pass.vcf
##bcftools_normCommand=norm -f ./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa --output ./out/joint_calls.merged.filt.pass.norm.vcf.tmp/joint_calls.merged.filt.pass.norm.vcf ./out/joint_calls.merged.filt.pass.norm_tmp.vcf
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">"
[W::vcf_parse] contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] INFO 'AC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AF' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AN' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'BaseQRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ClippingRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DB' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'InbreedingCoeff' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MLEAC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MLEAF' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MQ' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MQRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'QD' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ReadPosRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SOR' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ANNOVAR_DATE' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Func.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Gene.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'GeneDetail.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ExonicFunc.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AAChange.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SIFT_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SIFT_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HDIV_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HDIV_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HVAR_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HVAR_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LRT_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LRT_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationTaster_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationTaster_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationAssessor_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationAssessor_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FATHMM_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FATHMM_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'RadialSVM_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'RadialSVM_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LR_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LR_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'VEST3_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'CADD_raw' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'CADD_phred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'GERP++_RS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'phyloP46way_placental' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'phyloP100way_vertebrate' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SiPhy_29way_logOdds' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'snp138NonFlagged' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO '1000g2015aug_eur' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'nci60' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'cosmic70' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'clinvar_20150629' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ALLELE_END' is not defined in the header, assuming Type=String
[W::vcf_parse] contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
...
martijnvermaat commented 8 years ago

Thanks @jrderuiter for reporting this.

A quick diff shows two things:

colordiff -ruN
  <(curl https://gist.githubusercontent.com/jrderuiter/ca373a8d4b4652379244/raw/35c4464de809fade8a46d7eb8f61916bdf51fb31/gistfile1.txt
    | sort)
  <(curl https://gist.githubusercontent.com/jrderuiter/e1a4b2d5e437f7652e8c/raw/50af1092976905ac290a56b46407c757a9bbf2d4/gistfile1.txt
    | sort)
  1. The assembly fields in the contig headers are lost.
  2. The GATKCommandLine.VariantFiltration header is truncated in the CommandLineOptions field at the position where the original contains a < character.

The former is too bad, but not the cause of your issue. The latter is due to PyVCF not properly parsing the header lines (see also #90). I'll try to improve this.

jrderuiter commented 8 years ago

Thank you for looking into this. The two differences you mention are indeed the differences that I also noticed myself. The assembly information would be nice to keep, but is as you say not critical. A fix for 2 would be great though! On Thu 8 Oct 2015 at 11:39 Martijn Vermaat notifications@github.com wrote:

Thanks @jrderuiter https://github.com/jrderuiter for reporting this.

A quick diff shows two things:

colordiff -ruN <(curl https://gist.githubusercontent.com/jrderuiter/ca373a8d4b4652379244/raw/35c4464de809fade8a46d7eb8f61916bdf51fb31/gistfile1.txt | sort) <(curl https://gist.githubusercontent.com/jrderuiter/e1a4b2d5e437f7652e8c/raw/50af1092976905ac290a56b46407c757a9bbf2d4/gistfile1.txt | sort)

  1. The assembly fields in the contig headers are lost.
  2. The GATKCommandLine.VariantFiltration header is truncated in the CommandLineOptions field at the position where the original contains a < character.

The former is too bad, but not the cause of your issue. The latter is due to PyVCF not properly parsing the header lines (see also #90 https://github.com/jamescasbon/PyVCF/issues/90). I'll try to improve this.

— Reply to this email directly or view it on GitHub https://github.com/jamescasbon/PyVCF/issues/210#issuecomment-146474215.

martijnvermaat commented 8 years ago

You could try again with the master branch:

pip install -e git+https://github.com/jamescasbon/PyVCF.git#egg=PyVCF