pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Parsing INFO fields with "Number=G" fails #593

Closed AndreasHeger closed 6 years ago

AndreasHeger commented 6 years ago

Pysam fails parsing VCFs created by the gnomAD project. The issue is that these VCFs define records such as:

##INFO=<ID=GC_AFR,Number=G,Type=Integer,Description="Count of African/African American individuals for each genotype">

which pysam explicitely ignores:

import pysam
fn = "gnomad.genomes.r2.0.1.sites.22.vcf.gz"
bcf_in = pysam.VariantFile(fn)
rec = next(bcf_in.fetch())
print(rec.info.keys())
print(rec.info["AC"])
print(rec.info["GC"]

gives

['AC', 'AF', 'AN', 'BaseQRankSum', 'ClippingRankSum', 'DB', 'DP', 'FS', 'InbreedingCoeff', 'MQ', 'MQRankSum', 'QD', 'ReadPosRankSum', 'SOR', 'VQSLOD', 'VQSR_culprit', 'GQ_HIST_ALT', 'DP_HIST_ALT', 'AB_HIST_ALT', 'GQ_HIST_ALL', 'DP_HIST_ALL', 'AB_HIST_ALL', 'AC_AFR', 'AC_AMR', 'AC_ASJ', 'AC_EAS', 'AC_FIN', 'AC_NFE', 'AC_OTH', 'AC_Male', 'AC_Female', 'AN_AFR', 'AN_AMR', 'AN_ASJ', 'AN_EAS', 'AN_FIN', 'AN_NFE', 'AN_OTH', 'AN_Male', 'AN_Female', 'AF_AFR', 'AF_AMR', 'AF_ASJ', 'AF_EAS', 'AF_FIN', 'AF_NFE', 'AF_OTH', 'AF_Male', 'AF_Female', 'GC_AFR', 'GC_AMR', 'GC_ASJ', 'GC_EAS', 'GC_FIN', 'GC_NFE', 'GC_OTH', 'GC_Male', 'GC_Female', 'AC_raw', 'AN_raw', 'AF_raw', 'GC_raw', 'GC', 'Hom_AFR', 'Hom_AMR', 'Hom_ASJ', 'Hom_EAS', 'Hom_FIN', 'Hom_NFE', 'Hom_OTH', 'Hom_Male', 'Hom_Female', 'Hom_raw', 'Hom', 'POPMAX', 'AC_POPMAX', 'AN_POPMAX', 'AF_POPMAX', 'DP_MEDIAN', 'DREF_MEDIAN', 'GQ_MEDIAN', 'AB_MEDIAN', 'AS_RF', 'AS_FilterStatus', 'CSQ']
(67,)
Traceback (most recent call last):
  File "x.py", line 9, in <module>
    print(rec.info["GC"])
  File "pysam/libcbcf.pyx", line 2535, in pysam.libcbcf.VariantRecordInfo.__getitem__ (pysam/libcbcf.c:38590)
  File "pysam/libcbcf.pyx", line 522, in pysam.libcbcf.bcf_info_get_value (pysam/libcbcf.c:8630)
  File "pysam/libcbcf.pyx", line 505, in pysam.libcbcf.bcf_get_value_count (pysam/libcbcf.c:8476)
  File "pysam/libcbcf.pyx", line 226, in pysam.libcbcf.bcf_genotype_count (pysam/libcbcf.c:4797)
ValueError: genotype is only valid as a format field
bioinformed commented 6 years ago

See https://github.com/samtools/hts-specs/issues/272

AndreasHeger commented 6 years ago

The consensus of the discussion in hts-specs is to deprecate Number=G in the INFO field. A workaround to parse the Gnomad files is to replace Number=G with Number=. within the file:

cat gnomad.vcf | perl -p -e 's/Number=G/Number=./ if (/INFO/)'