jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
402 stars 200 forks source link

IndexError with a VCFv4.1 file | PyVCF-0.6.7 | macosx-10.5-x86_64 #187

Closed deep-introspection closed 10 years ago

deep-introspection commented 10 years ago

Hi!

It seems I am running into a common problem with Headers but can't figure out how to solve it. I am using PyVCF-0.6.7-py2.7-macosx-10.5-x86_64 and try to iterate records on a VCF file and get this error:

In [11]: for record in vcf_reader:
    print record.CHROM + ' ' + str(record.POS) + ' ' + str(record.ALT)
    for n in range(0,13):
        print record.samples[n]['GT']

1 861323 [None]
0/0
0/0
0/0
0/0
0/0
0/0
./.
0/0
0/0
0/0
0/0
0/0
0/0
1 861324 [None]
0/0
0/0
0/0
0/0
0/0
0/0
./.
0/0
0/0
0/0
0/0
0/0
0/0

[ . . . ]

1 879411 [None]
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
1 879412 [None]
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-a45b703a98d9> in <module>()
----> 1 for record in vcf_reader:
      2     print record.CHROM + ' ' + str(record.POS) + ' ' + str(record.ALT)
      3     for n in range(0,13):
      4         print record.samples[n]['GT']

/Users/XXX/anaconda/lib/python2.7/site-packages/PyVCF-0.6.7-py2.7-macosx-10.5-x86_64.egg/vcf/parser.pyc in next(self)
    566         else:
    567             filt = filt.split(';')
--> 568         info = self._parse_info(row[7])
    569 
    570         try:

/Users/XXX/anaconda/lib/python2.7/site-packages/PyVCF-0.6.7-py2.7-macosx-10.5-x86_64.egg/vcf/parser.pyc in _parse_info(self, info_str)
    382 
    383             if entry_type == 'Integer':
--> 384                 vals = entry[1].split(',')
    385                 try:
    386                     val = self._map(int, vals)

IndexError: list index out of range

Apparently, the error comes in when PyVCF reads the first line of the VCF that contains an alternative allele (before there is only missing information or "."):

1   879412  .   C   .   1139.16 .   AC=0;AF=0.00;AN=26;CAnc=C;CpG=;DP=263;GRP=1.4;MQ=0;MQ0=0;Map20=1;TS=HP;TSseq=C,C;bSC=934;pSC=0.058  GT:DP:GQ:PL:A:C:G:T:IR:MQ:MQ0   0/0:31:93.30:0,93,1229:0,0:15,16:0,0:0,0:0:0:0  0/0:36:99:0,105,1400:0,0:34,2:0,0:0,0:0:0:0 0/0:9:27.09:0,27,352:0,0:7,2:0,0:0,0:0:0:0  0/0:16:48.14:0,48,610:0,0:7,9:0,0:0,0:0:0:0 0/0:13:36.10:0,36,448:0,0:9,4:0,0:0,0:0:0:0 0/0:15:39.12:0,39,493:0,0:10,5:0,0:0,0:0:0:0    0/0:30:84.24:0,84,1052:0,0:11,19:0,0:0,0:0:0:0  0/0:22:60.18:0,60,741:0,0:7,15:0,0:0,0:0:0:0    0/0:26:75.23:0,75,974:0,0:13,13:0,0:0,0:0:0:0   0/0:22:54.10:0,54,558:0,0:12,9:0,0:0,0:1:0:0    0/0:16:48.15:0,48,615:0,0:9,6:0,0:0,0:0:0:0 0/0:14:42.14:0,42,547:0,0:5,9:0,0:0,0:0:0:0 0/0:13:36.11:0,36,455:0,0:8,5:0,0:0,0:0:0:0
1   879413  rs116279254 G   A   1255.54 .   1000gALT=A;AC=1;AF=0.04;AF1000g=0.01;AFR_AF=0.02;AN=26;CAnc=G;CpG=;DP=262;GRP=-3.42;HRun;MQ=0;MQ0=0;Map20=1;TS=HP;TSseq=G,G;bSC=934;pSC=0.045   GT:DP:GQ:PL:A:C:G:T:IR:MQ:MQ0   0/0:33:96.31:0,96,1267:0,1:0,0:16,16:0,0:0:0:0  0/0:34:96.32:0,96,1280:2,0:0,0:29,3:0,0:0:0:0   0/0:8:24.08:0,24,313:0,0:0,0:6,2:0,0:0:0:0  0/0:17:51.13:0,51,632:0,0:0,0:8,9:0,0:0:0:0 0/1:13:99:228,0,137:6,2:0,0:3,2:0,0:0:0:0   0/0:15:39.10:0,39,481:0,0:0,0:10,5:0,0:0:0:0    0/0:30:84.25:0,84,1037:0,0:0,0:12,18:0,0:0:0:0  0/0:22:57.17:0,57,711:0,0:0,0:7,15:0,0:0:0:0    0/0:26:78.23:0,78,1001:0,0:0,0:13,13:0,0:0:0:0  0/0:21:54.05:0,54,509:0,0:0,0:12,8:0,0:0:0:0    0/0:16:45.14:0,45,566:0,0:0,0:9,6:0,0:0:0:0 0/0:14:39.13:0,39,509:0,0:0,0:5,9:0,0:0:0:0 0/0:13:33.11:0,33,425:0,0:0,0:8,4:0,1:0:0:0

The Header of the VCF is this one:

##fileformat=VCFv4.1
##FORMAT=<ID=A,Number=2,Type=Integer,Description="Number of A bases on forward and reverse strand">
##FORMAT=<ID=C,Number=2,Type=Integer,Description="Number of C bases on forward and reverse strand">
##FORMAT=<ID=G,Number=2,Type=Integer,Description="Number of G bases on forward and reverse strand">
##FORMAT=<ID=T,Number=2,Type=Integer,Description="Number of T bases on forward and reverse strand">
##FORMAT=<ID=IR,Number=1,Type=Integer,Description="Number of reads with InDel starting at this position">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##FORMAT=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Filtered Depth">
##INFO=<ID=AF1000g,Number=1,Type=Float,Description="Global alternative allele frequency (AF) based on Alternate Allele Count/Total Allele Count in the 20110521 1000Genome release">
##INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Alternative allele frequency (AF) for samples from AMR based on 1000G">
##INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Alternative allele frequency (AF) for samples from ASN based on 1000G">
##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Alternative allele frequency (AF) for samples from AFR based on 1000G">
##INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Alternative allele frequency (AF) for samples from EUR based on 1000G">
##INFO=<ID=1000gALT,Number=1,Type=String,Description="Alternative allele referred to by 1000G">
##INFO=<ID=TS,Number=1,Type=String,Description="Sequences in Ensembl v64 EPO Compara 6 primate block">
##INFO=<ID=TSseq,Number=1,Type=String,Description="Primary species bases (in order of TS field) in the EPO Compara 6 primate block">
##INFO=<ID=CAnc,Number=1,Type=String,Description="Ref-Chimp/Human ancestor base at this position">
##INFO=<ID=GAnc,Number=1,Type=String,Description="Ref-Gorilla ancestor base at this position">
##INFO=<ID=OAnc,Number=1,Type=String,Description="Ref-Orang ancestor base at this position">
##INFO=<ID=mSC,Number=1,Type=Float,Description="PhastCons Mammalian conservation score (excluding human)">
##INFO=<ID=pSC,Number=1,Type=Float,Description="PhastCons Primate conservation score (excluding human)">
##INFO=<ID=GRP,Number=1,Type=Float,Description="GERP conservation score">
##INFO=<ID=bSC,Number=1,Type=Float,Description="B score">
##INFO=<ID=Map20,Number=1,Type=Float,Description="Mapability score of Duke University (determined from 20bp reads)">
##INFO=<ID=CpG,Number=0,Type=Flag,Description="Position is in a CpG context based on the Ref/Ancestor">
##INFO=<ID=RM,Number=0,Type=Flag,Description="Position is repeat masked in the reference sequence of the EPO 6 primate block">
##INFO=<ID=SysErr,Number=0,Type=Flag,Description="Position was identified as systematic error in the 1000 genome trios">
##INFO=<ID=SysErrHCB,Number=0,Type=Flag,Description="Position was identified as systematic error based on shared SNPs in human, chimp and bonobo">
##INFO=<ID=UR,Number=0,Type=Flag,Description="Position is in a copy number control region identified by the Eichler lab">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities from Altai">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Filtered Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias from Altai">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes from Altai">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities from Altai">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth from Altai">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias from Altai">
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[/mnt/ngs_data/AltaiNeanderthal_temp_folder/final_bam/hg19_1000g/AltaiNea.hg19_1000g.1.dq.bam] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=[org.broadinstitute.sting.commandline.IntervalBinding@68ed662d] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/mnt/solexa/Genomes/hg19_1000g/whole_genome.fa rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false disable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.001 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_ALL_SITES standard_min_confidence_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=30.0 computeSLOD=false alleles=(RodBinding name= source=UNBOUND) assume_single_sample_reads=null min_base_quality_score=17 max_deletion_fraction=0.05 min_indel_count_for_genotyping=5 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10.0 indelGapOpenPenalty=45.0 indelHaplotypeSize=80 bandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##contig=<ID=1,length=249250621>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=2,length=243199373>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=MT,length=16569>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##reference=file:///mnt/solexa/Genomes/hg19_1000g/whole_genome.fa
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Altai   Vindija Sidron  Denisovan   HGDP01284   HGDP00927   DNK02   HGDP00521   HGDP00665   NA12891 HGDP00778   HGDP01307   HGDP00542

Any idea of how I can make PyVCF comes through this?

Cheers!

Guillaume

martijnvermaat commented 10 years ago

Hi Guillaume,

The problem is an error in your VCF file. If you look at your second example record, you see this in the info field:

...;GRP=-3.42;HRun;MQ=0;...

So the HRun key is used as a boolean flag, but in the header it is defined to have an integer value:

##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">

So this is incorrect. I see the file is created by the GATK UnifiedGenotyper. In one of their help pages, I see the HRun annotation field being mentioned, where all examples include an integer value.

Even if we would want to handle this case in PyVCF, I'm not even sure what to do with it. What would it mean?

If you're producing the file yourself you may want to try updating GATK or asking the GATK developers.

deep-introspection commented 10 years ago

Thanks for those precisions. I am unfortunately not producer of the VCF and thus would have to solve this by modifying it in its current version. Since I won't use this information HRun, integer or boolean, do you think the file can be slightly modified (with a regexp for instance) so PyVCF wont't get confused?

Cheers!

Guillaume

martijnvermaat commented 10 years ago

I see. If you're on Linux/OSX/Unix you could use sed to remove these values:

sed 's/;HRun;/;/' original.vcf > fixed.vcf

This ignores any HRun occurrences directly at the start or end of the INFO field, but that probably never happens.

It could however be the case that the file still has other (similar) errors.

deep-introspection commented 10 years ago

This works perfectly! Thanks a lot for your help.

Guillaume