jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
401 stars 200 forks source link

Error in reading vcf file #91

Closed janedanes closed 11 years ago

janedanes commented 11 years ago

I'm a newbie to python and am having trouble understanding why I can't parse my vcf file. Any help would greatly be appreciated!! Thanks so much.

for record in vcf.Reader(open('all.mergedSNPs.vcf', 'r')): print record

Here is the output:

Traceback (most recent call last): File "", line 1, in File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 216, in init self._parse_metainfo() File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 254, in _parse_metainfo key, val = parser.read_meta(line.strip()) File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta return self.read_meta_hash(meta_string) File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash val = dict(item.split("=") for item in hashItems) ValueError: dictionary update sequence element #4 has length 1; 2 is required

seandavi commented 11 years ago

You might try updating to the newest PyVCF version, 0.6.3, to start. I think there were a few bugs fixed between your version (0.6.0) and the current version.

janedanes commented 11 years ago

thanks!

ok I did that and I got this error

import vcf vcf_reader = vcf.Reader(open('/Users/sgbiofuels/Desktop/all.mergedSNPs.vcf','r')) Traceback (most recent call last): File "", line 1, in File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 225, in init self._parse_metainfo() File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 267, in _parse_metainfo key, val = parser.read_meta(line) File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta return self.read_meta_hash(meta_string) File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash val = OrderedDict(item.split("=") for item in hashItems) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 74, in init self.update(_args, *_kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_abcoll.py", line 499, in update for key, value in other: ValueError: need more than 1 value to unpack

jamescasbon commented 11 years ago

Its failling over on the metadata.

Can you share the file headers?

janedanes commented 11 years ago

I found out that the algorithm which outputs the vcf file has put 4 # in one of the FORMAT lines. I cleaned that up in vi.

Thanks!

On Tue, Feb 5, 2013 at 5:05 AM, James Casbon notifications@github.comwrote:

Its failling over on the metadata.

Can you share the file headers?

— Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/91#issuecomment-13128218.

amkenney commented 10 years ago

Hello, I am getting a very similar error, but I can't figure out why. I have pyvcf6.0 installed, which I used to sucessfully parse a vcf file recently.

This is the error I get:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    v = vcf.Reader(filename='CACid.MIM.snps.q40.3.26.14.vcf.gz')
  File "vcf/parser.py", line 216, in __init__
    self._parse_metainfo()
  File "vcf/parser.py", line 254, in _parse_metainfo
    key, val = parser.read_meta(line.strip())
  File "vcf/parser.py", line 166, in read_meta
    return self.read_meta_hash(meta_string)
  File "vcf/parser.py", line 161, in read_meta_hash
    val = dict(item.split("=") for item in hashItems)
ValueError: dictionary update sequence element #4 has length 4; 2 is required

I didn't use to get this error. It seems to only be with some new vcf files I made using the newest version of the GATK.

Thoughts? Thanks, Amanda

amkenney commented 10 years ago

This is what my header looks like. None of the format lines have 4 # signs in them.

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.0-0-g6bad1c6,Date="Wed Mar 26 14:12:27 EDT 2014",Epoch=1395857547192,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[AHQT1G_q29.paired.nodup.realign.bam, BOG10G_q29.paired.nodup.realign.bam, DUNG_q29.paired.nodup.realign.bam, IM62.JGI_q29.paired.nodup.realign.bam, REM8G_q29.paired.nodup.realign.bam, SLP9G_q29.paired.nodup.realign.bam, CAC9N_q29.paired.nodup.realign.bam, KOOTN_q29.paired.nodup.realign.bam, MEN104_q29.paired.nodup.realign.bam, Mdent_q29.paired.nodup.realign.bam, LVR.bwamem.q29.paired.nodups.realign.bam, Mlac.WLF47.q29.nodup.paired.realign.bam, Mtil.BAG3.q29.nodup.paired.realign.bam, Mtil.STV115.q29.nodups.paired.realign.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=mg.new.ref.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 allow_bqsr_on_reduced_bams_despite_repeated_warnings=false validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false output_mode=EMIT_ALL_CONFIDENT_SITES heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY standard_min_confidence_threshold_for_calling=40.0 standard_min_confidence_threshold_for_emitting=40.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=scaffold_1,length=13654788>
##contig=<ID=scaffold_2,length=19086486>
##contig=<ID=scaffold_3,length=18883600>
##contig=<ID=scaffold_4,length=21212587>
##contig=<ID=scaffold_5,length=24294111>
##contig=<ID=scaffold_6,length=19424543>
##contig=<ID=scaffold_7,length=18155354>
##contig=<ID=scaffold_8,length=24938873>
##contig=<ID=scaffold_9,length=15137105>
##contig=<ID=scaffold_10,length=19333236>
##contig=<ID=scaffold_11,length=25550741>
##contig=<ID=scaffold_12,length=26648731>
##contig=<ID=scaffold_13,length=21248885>
##contig=<ID=scaffold_14,length=26119285>
##contig=<ID=scaffold_15,length=2017172>
##contig=<ID=scaffold_16,length=1459161>
##contig=<ID=scaffold_17,length=1210954>
##contig=<ID=scaffold_23,length=393100>
##contig=<ID=scaffold_34,length=217123>
##contig=<ID=scaffold_38,length=241004>
##contig=<ID=scaffold_43,length=184681>
##contig=<ID=scaffold_45,length=162397>
##contig=<ID=scaffold_47,length=159104>
##contig=<ID=scaffold_52,length=130840>
##contig=<ID=scaffold_53,length=120786>
##contig=<ID=scaffold_56,length=106712>
##contig=<ID=scaffold_57,length=132043>
##contig=<ID=scaffold_61,length=86401>
##contig=<ID=scaffold_63,length=86991>
##contig=<ID=scaffold_66,length=79613>
##contig=<ID=scaffold_68,length=137247>
...deleted a bunch of lines...
##reference=file:///panfs/pstor.storage/mim.bwa.bams.proper.only/mg.new.ref.fa
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  AHQT1G  BOG10G  CAC9N   DUNG    IM62.JGI        KootN   MEN104  Mdent   Mlac.WLF47      Mtil.BAG3       Mtil.STV115     REM8G   SLP9G   til.LVR
amkenney commented 10 years ago

Nothing is appearing my = signs....I have no idea why

martijnvermaat commented 10 years ago

I have no problems parsing that header with the latest PyVCF version (0.6.7). Could you try updating PyVCF? If you installed with pip, a simple pip install -U PyVCF should do it.

(I added fenced code blocks to your comments so they correctly show the example code.)

amkenney commented 10 years ago

I will try the new version and let you know. Thanks!

amkenney commented 10 years ago

Forgive my ignorance, but I only see version 0.6 available for download under releases. I haven't downloaded many program from github (my computer cluster people usually do it). How can I get 0.6.7 or the most recent update?

Thanks!

martijnvermaat commented 10 years ago

How did you previously install PyVCF?

If you installed from PyPI (either using easy_install or pip), you can upgrade with one of the following:

easy_install -U PyVCF
pip install -U PyVCF

If you just want to download the source code, you can do so from the PyPI page, or use the "Download ZIP" button at the bottom-right of the GitHub page.

(You are right that releases after 0.6 were not tagged, so they are not listed on the GitHub releases page.)

amkenney commented 10 years ago

I don't think I every truly "installed" PyVCF. I downloaded the latest release file (0.6). Then I copied the folder 'vcf' into the directory I needed to use PyVCF. Then I start python by calling idle from that directory and import vcf at the beginning of my script. I was introduced by PyVCF by someone else who did it that way, but I think he only did that way because of a single use need.

But, I would rather actually install PyVCF correctly so that I can use it from any directory without having to copy the 'vcf' folder every time. This is sort of a testament to my lack of understanding of how computer programs get installed/integrated on a computer.

I use a mac. I don't know what 'easy_install' or 'pip' are.

martijnvermaat commented 10 years ago

I don't have a mac, but you could probably just run the following command to install PyVCF (needs administrative rights):

sudo easy_install PyVCF
amkenney commented 10 years ago

Updating to 0.6.7 worked. Thanks!