Closed janedanes closed 11 years ago
You might try updating to the newest PyVCF version, 0.6.3, to start. I think there were a few bugs fixed between your version (0.6.0) and the current version.
thanks!
ok I did that and I got this error
import vcf vcf_reader = vcf.Reader(open('/Users/sgbiofuels/Desktop/all.mergedSNPs.vcf','r')) Traceback (most recent call last): File "
", line 1, in File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 225, in init self._parse_metainfo() File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 267, in _parse_metainfo key, val = parser.read_meta(line) File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta return self.read_meta_hash(meta_string) File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash val = OrderedDict(item.split("=") for item in hashItems) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 74, in init self.update(_args, *_kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_abcoll.py", line 499, in update for key, value in other: ValueError: need more than 1 value to unpack
Its failling over on the metadata.
Can you share the file headers?
I found out that the algorithm which outputs the vcf file has put 4 # in one of the FORMAT lines. I cleaned that up in vi.
Thanks!
On Tue, Feb 5, 2013 at 5:05 AM, James Casbon notifications@github.comwrote:
Its failling over on the metadata.
Can you share the file headers?
— Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/91#issuecomment-13128218.
Hello, I am getting a very similar error, but I can't figure out why. I have pyvcf6.0 installed, which I used to sucessfully parse a vcf file recently.
This is the error I get:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
v = vcf.Reader(filename='CACid.MIM.snps.q40.3.26.14.vcf.gz')
File "vcf/parser.py", line 216, in __init__
self._parse_metainfo()
File "vcf/parser.py", line 254, in _parse_metainfo
key, val = parser.read_meta(line.strip())
File "vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "vcf/parser.py", line 161, in read_meta_hash
val = dict(item.split("=") for item in hashItems)
ValueError: dictionary update sequence element #4 has length 4; 2 is required
I didn't use to get this error. It seems to only be with some new vcf files I made using the newest version of the GATK.
Thoughts? Thanks, Amanda
This is what my header looks like. None of the format lines have 4 # signs in them.
##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.0-0-g6bad1c6,Date="Wed Mar 26 14:12:27 EDT 2014",Epoch=1395857547192,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[AHQT1G_q29.paired.nodup.realign.bam, BOG10G_q29.paired.nodup.realign.bam, DUNG_q29.paired.nodup.realign.bam, IM62.JGI_q29.paired.nodup.realign.bam, REM8G_q29.paired.nodup.realign.bam, SLP9G_q29.paired.nodup.realign.bam, CAC9N_q29.paired.nodup.realign.bam, KOOTN_q29.paired.nodup.realign.bam, MEN104_q29.paired.nodup.realign.bam, Mdent_q29.paired.nodup.realign.bam, LVR.bwamem.q29.paired.nodups.realign.bam, Mlac.WLF47.q29.nodup.paired.realign.bam, Mtil.BAG3.q29.nodup.paired.realign.bam, Mtil.STV115.q29.nodups.paired.realign.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=mg.new.ref.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 allow_bqsr_on_reduced_bams_despite_repeated_warnings=false validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false output_mode=EMIT_ALL_CONFIDENT_SITES heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY standard_min_confidence_threshold_for_calling=40.0 standard_min_confidence_threshold_for_emitting=40.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=scaffold_1,length=13654788>
##contig=<ID=scaffold_2,length=19086486>
##contig=<ID=scaffold_3,length=18883600>
##contig=<ID=scaffold_4,length=21212587>
##contig=<ID=scaffold_5,length=24294111>
##contig=<ID=scaffold_6,length=19424543>
##contig=<ID=scaffold_7,length=18155354>
##contig=<ID=scaffold_8,length=24938873>
##contig=<ID=scaffold_9,length=15137105>
##contig=<ID=scaffold_10,length=19333236>
##contig=<ID=scaffold_11,length=25550741>
##contig=<ID=scaffold_12,length=26648731>
##contig=<ID=scaffold_13,length=21248885>
##contig=<ID=scaffold_14,length=26119285>
##contig=<ID=scaffold_15,length=2017172>
##contig=<ID=scaffold_16,length=1459161>
##contig=<ID=scaffold_17,length=1210954>
##contig=<ID=scaffold_23,length=393100>
##contig=<ID=scaffold_34,length=217123>
##contig=<ID=scaffold_38,length=241004>
##contig=<ID=scaffold_43,length=184681>
##contig=<ID=scaffold_45,length=162397>
##contig=<ID=scaffold_47,length=159104>
##contig=<ID=scaffold_52,length=130840>
##contig=<ID=scaffold_53,length=120786>
##contig=<ID=scaffold_56,length=106712>
##contig=<ID=scaffold_57,length=132043>
##contig=<ID=scaffold_61,length=86401>
##contig=<ID=scaffold_63,length=86991>
##contig=<ID=scaffold_66,length=79613>
##contig=<ID=scaffold_68,length=137247>
...deleted a bunch of lines...
##reference=file:///panfs/pstor.storage/mim.bwa.bams.proper.only/mg.new.ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AHQT1G BOG10G CAC9N DUNG IM62.JGI KootN MEN104 Mdent Mlac.WLF47 Mtil.BAG3 Mtil.STV115 REM8G SLP9G til.LVR
Nothing is appearing my = signs....I have no idea why
I have no problems parsing that header with the latest PyVCF version (0.6.7). Could you try updating PyVCF? If you installed with pip
, a simple pip install -U PyVCF
should do it.
(I added fenced code blocks to your comments so they correctly show the example code.)
I will try the new version and let you know. Thanks!
Forgive my ignorance, but I only see version 0.6 available for download under releases. I haven't downloaded many program from github (my computer cluster people usually do it). How can I get 0.6.7 or the most recent update?
Thanks!
How did you previously install PyVCF?
If you installed from PyPI (either using easy_install
or pip
), you can upgrade with one of the following:
easy_install -U PyVCF
pip install -U PyVCF
If you just want to download the source code, you can do so from the PyPI page, or use the "Download ZIP" button at the bottom-right of the GitHub page.
(You are right that releases after 0.6 were not tagged, so they are not listed on the GitHub releases page.)
I don't think I every truly "installed" PyVCF. I downloaded the latest release file (0.6). Then I copied the folder 'vcf' into the directory I needed to use PyVCF. Then I start python by calling idle from that directory and import vcf at the beginning of my script. I was introduced by PyVCF by someone else who did it that way, but I think he only did that way because of a single use need.
But, I would rather actually install PyVCF correctly so that I can use it from any directory without having to copy the 'vcf' folder every time. This is sort of a testament to my lack of understanding of how computer programs get installed/integrated on a computer.
I use a mac. I don't know what 'easy_install' or 'pip' are.
I don't have a mac, but you could probably just run the following command to install PyVCF (needs administrative rights):
sudo easy_install PyVCF
Updating to 0.6.7 worked. Thanks!
I'm a newbie to python and am having trouble understanding why I can't parse my vcf file. Any help would greatly be appreciated!! Thanks so much.
for record in vcf.Reader(open('all.mergedSNPs.vcf', 'r')): print record
Here is the output:
Traceback (most recent call last): File "", line 1, in
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 216, in init
self._parse_metainfo()
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 254, in _parse_metainfo
key, val = parser.read_meta(line.strip())
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash
val = dict(item.split("=") for item in hashItems)
ValueError: dictionary update sequence element #4 has length 1; 2 is required