quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

Parsing field names #27

Closed matthdsm closed 7 years ago

matthdsm commented 7 years ago

Hi,

I'm running into an issue with dbNSFP annotations, but I'm not really sure if the issue is formed here, or in the geneimpacts module.

Our situation: Since VEP has proven to be untrustworthy with annotating dbNSFP, we've switched to vcfanno, which has been great so far. All required fields are available, with the right type.

VCF header:

##INFO=<ID=M-CAP_pred,Number=1,Type=String,Description="calculated by first of overlapping values in column 69 from /home/galaxy/bcbio/genomes/Hsapiens/hg38/variation/dbNSFP.txt.gz">
##INFO=<ID=M-CAP_rankscore,Number=1,Type=Float,Description="calculated by mean of overlapping values in column 68 from /home/galaxy/bcbio/genomes/Hsapiens/hg38/variation/dbNSFP.txt.gz">
##INFO=<ID=M-CAP_score,Number=1,Type=Float,Description="calculated by max of overlapping values in column 67 from /home/galaxy/bcbio/genomes/Hsapiens/hg38/variation/dbNSFP.txt.gz">

However, in the case of the above M-CAP fields, the field name get parsed to m_cap_pred, m_cap_rankscore and m_cap_score, so everything gets lowercased and dashed are converted to underscores. That's good and all, minor issue, but the columns with the adjusted name contain no data al all.

Is it possible that by changing the column name, the data lookup is broken?

example vcf entry:

chr1    979496  rs13302983      T       C       9681.8  PASS    AC=2;AF=1;AN=2;DB;DP=279;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=34.83;SOR=0.708;CSQ=C|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding|3/5||ENST00000341290.6:c.1192A>G|ENSP00000343864.2:p.Ser398Gly|1228/3035|1192/2031|398/676|S/G|Agc/Ggc|rs13302983||-1||SNV|HGNC|HGNC:28208||2|A2||ENSP00000343864|Q5SV97||UPI000022DAF4||tolerated(0.2)|benign(0)|||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|protein_coding|||||-/2194|-/1731|-/576|||rs13302983|4488|1||SNV|HGNC|HGNC:25284||1|A2|CCDS53256.1|ENSP00000368717|Q494U1||UPI00005764FF||||||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|protein_coding|||||-/2455|-/1992|-/663|||rs13302983|4488|1||SNV|HGNC|HGNC:25284||2|A2||ENSP00000368719|Q494U1||UPI0000D61E06||||||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|protein_coding|||||-/2404|-/1836|-/611|||rs13302983|4388|1||SNV|HGNC|HGNC:25284|YES|1|P3|CCDS4.1|ENSP00000368720|Q494U1||UPI00001416D8||||||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||ENST00000433179.3:c.1534A>G|ENSP00000414022.3:p.Ser512Gly|1534/3340|1534/2373|512/790|S/G|Agc/Ggc|rs13302983||-1||SNV|HGNC|HGNC:28208|YES|5|P2|CCDS76083.1|ENSP00000414022|Q5SV97||UPI0003E30FA7||tolerated(0.18)|benign(0.013)|||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||-/1208|||||rs13302983|2855|-1||SNV|HGNC|HGNC:28208||1||||||||||||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||,C|downstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000491024|protein_coding|||||-/698|-/531|-/176|||rs13302983|3631|1|cds_start_NF|SNV|HGNC|HGNC:25284||3|||ENSP00000462558||J3KSM5|UPI000268AE1F||||||T:0.0331|C:0.9418|C:0.9539|C:0.9990|C:0.9592|C:0.9847|||C:0.954|C:0.9664|C:0.957|C:0.9554|C:0.9978|C:0.9741|C:0.9386|C:0.9533|C:0.9823||||||||||||||||;LRT_score=0.2111;LRT_converted_rankscore=0.03385;LRT_pred=N;LRT_Omega=1.552310;MutationTaster_score=1,1;MutationTaster_converted_rankscore=0.08979;MutationTaster_pred=P,P;MutationTaster_model=simple_aae,simple_aae;MutationTaster_AAE=S398G,S418G;MutationAssessor_UniprotID=.;MutationAssessor_variant=.;MutationAssessor_score=0;MutationAssessor_score_rankscore=0;MutationAssessor_pred=.;FATHMM_score=0;FATHMM_converted_rankscore=0.5265;FATHMM_pred=T,.;PROVEAN_score=0;PROVEAN_converted_rankscore=0.3517;PROVEAN_pred=N,.;Transcript_id_VEST3=ENST00000433179,ENST00000341290;Transcript_var_VEST3=S418G,S398G;VEST3_score=0;VEST3_rankscore=0.0242;MetaSVM_score=-1.0002;MetaSVM_rankscore=0.2986;MetaSVM_pred=T;MetaLR_score=0;MetaLR_rankscore=0.00011;MetaLR_pred=T;Reliability_index=8;M-CAP_score=0;M-CAP_rankscore=0;M-CAP_pred=.;CADD_raw=-1.062;CADD_raw_rankscore=0.0297;CADD_phred=0.012;DANN_score=0.8967;DANN_rankscore=0.1842;fathmm-MKL_coding_score=0.0599;fathmm-MKL_coding_rankscore=0.1181;fathmm-MKL_coding_pred=N;fathmm-MKL_coding_group=AEFDBCI;Eigen_coding_or_noncoding=c;Eigen-raw=-1.5553;Eigen-phred=0.0657;Eigen-PC-raw=-1.6084;Eigen-PC-phred=0.0728;Eigen-PC-raw_rankscore=0.01576;GenoCanyon_score=0.9011;GenoCanyon_score_rankscore=0.2606;integrated_fitCons_score=0.4031;integrated_fitCons_score_rankscore=0.0524;integrated_confidence_value=0;GM12878_fitCons_score=0.5781;GM12878_fitCons_score_rankscore=0.3309;GM12878_confidence_value=0;H1-hESC_fitCons_score=0.5781;H1-hESC_fitCons_score_rankscore=0.2902;H1-hESC_confidence_value=0;HUVEC_fitCons_score=0.5628;HUVEC_fitCons_score_rankscore=0.2023;HUVEC_confidence_value=0;GERP++_NR=4.81;GERP++_RS=-2.46;GERP++_RS_rankscore=0.0606;phyloP100way_vertebrate=-1.229;phyloP100way_vertebrate_rankscore=0.0303;phyloP20way_mammalian=-0.308;phyloP20way_mammalian_rankscore=0.079;phastCons100way_vertebrate=0;phastCons100way_vertebrate_rankscore=0.0633;phastCons20way_mammalian=0.001;phastCons20way_mammalian_rankscore=0.0434;SiPhy_29way_pi=0.1119:0.2403:0.1138:0.534;SiPhy_29way_logOdds=2.6788;SiPhy_29way_logOdds_rankscore=0.0474;clinvar_rs=.;clinvar_clnsig=.;clinvar_trait=.;clinvar_golden_stars_int=0;Interpro_domain=.;GTEx_V6_gene=ENSG00000187608.5|ENSG00000224969.1|ENSG00000272438.1|ENSG00000187608.5;GTEx_V6_tissue=Cells_Transformed_fibroblasts|Thyroid|Thyroid|Whole_Blood;an_EOG=30;ac_EOG=T:1&C:29;gc_hom_ref_EOG=0;gc_het_alt_EOG=1;gc_hom_alt_EOG=14        GT:AD:DP:GQ:PL  1/1:0,278:278:99:9710,832,0

Upon checking further, I notice all fields containing a dash (-) suffer from the same issue.

Thanks for looking into this. M

matthdsm commented 7 years ago

Also, the dbNSFP annotations aren't added to the variant_impacts table, but I suppose that's because they're not in a CSQ or EFF tag. @brentp, any idea if this could be solved?

Thanks Matthias

brentp commented 7 years ago

I saw this go by last week but forgot to have a look. I will look Tuesday morning. I suspect that you're right it's an issue with "-"

brentp commented 7 years ago

I just pushed a fix for this. It's not going to propagate to the variant_impacts table as that is for CSQ stuff as you noted. Please let me know if you see any more issues.

matthdsm commented 7 years ago

Hi Brent,

Thanks for the fix. Any chance in the future that these annotations will be included in the variant_impacts table? It would be nice to keep all functionality and not be dependent on VEP/snpEff.

M