pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

Problem Parsing INFO gnomAD frequencies #607

Open magorbe opened 6 years ago

magorbe commented 6 years ago

Hello, I'm doing this :

vcf_input2 = pysam.VariantFile("vcf_B.vcf.gz",'r') rec = next(vcf_input2.fetch()) print(rec.info["AC"]) print(rec.info["AF"])

when I print AG I get the same data that I have in the vcf input, but the AF data is wrong, in the vcf I have AF=1.06172e-04,1.06172e-04 but in rec.info["AF"] is (1.0, 0.6172000169754028).

Somebody can help me ? Thank you !

bioinformed commented 6 years ago

Thanks for the problem report. Can you show us the AC and AF VCF headers?

magorbe commented 6 years ago

Yes, there are :

INFO=

INFO=

bioinformed commented 6 years ago

Great! Can you also show the full VCF record from above (from the file directly)?

magorbe commented 6 years ago

Yeah, no problem :

10 14953 . G C,A 643.32 RF AC=3,3;AF=1.06172e-04,1.06172e-04;AN=28256;AC_AFR=3,2;AC_AMR=0,0;AC_ASJ=0,0;AC_EAS=0,0;AC_FIN=0,0;AC_NFE=0,1;AC_OTH=0,0;AC_Male=1,1;AC_Female=2,2;AN_AFR=7864;AN_AMR=762;AN_ASJ=282;AN_EAS=1546;AN_FIN=3084;AN_NFE=13832;AN_OTH=886;AN_Male=15602;AN_Female=12654;AF_AFR=3.81485e-04,2.54323e-04;AF_AMR=0.00000e+00,0.00000e+00;AF_ASJ=0.00000e+00,0.00000e+00;AF_EAS=0.00000e+00,0.00000e+00;AF_FIN=0.00000e+00,0.00000e+00;AF_NFE=0.00000e+00,7.22961e-05;AF_OTH=0.00000e+00,0.00000e+00;AF_Male=6.40943e-05,6.40943e-05;AF_Female=1.58053e-04,1.58053e-04;GC_AFR=3927,3,0,2,0,0;GC_AMR=381,0,0,0,0,0;GC_ASJ=141,0,0,0,0,0;GC_EAS=773,0,0,0,0,0;GC_FIN=1542,0,0,0,0,0;GC_NFE=6915,0,0,1,0,0;GC_OTH=443,0,0,0,0,0;GC_Male=7799,1,0,1,0,0;GC_Female=6323,2,0,2,0,0;AC_raw=5,4;AN_raw=30990;AF_raw=1.61342e-04,1.29074e-04;GC_raw=15486,5,0,4,0,0;GC=14122,3,0,3,0,0;AC_POPMAX=3,2;AN_POPMAX=7864,7864;AF_POPMAX=3.81485e-04,2.54323e-04

bioinformed commented 6 years ago

And what version of pysam are you using and on what platform? (sorry-- should have asked all of this all at once.)

magorbe commented 6 years ago

Don't Worry :

Name: pysam Version: 0.11.2.1

Name: Python Version: 2.7.12

magorbe commented 6 years ago

Platform : Linux

bioinformed commented 6 years ago

After re-reading the specs, it appears that explicit support for accepting floating point values in scientific notation was only added to the VCF 4.3 spec. No mention is made in the VCF 4.2 spec. I'll dig into the htslib code to see what it says about the matter, since it and htsjdk is often the final arbiter of truth.

magorbe commented 6 years ago

But then, was is not contemplated that the VCF could have floating-point values? Is it possible that I can modify the function that do the parsing in order to solve it in some way? Thanks!