quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

snpEff annotations missing from db #31

Closed matthdsm closed 6 years ago

matthdsm commented 7 years ago

Hi Brent,

I've encountered a small issue where some annotations have gone missing. I annotated a vcf file with SnpEff using the following command:

export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH &&  /home/galaxy/bcbio-dev/anaconda/bin/snpEff -Xms750m -Xmx11g -Djava.io.tmpdir=/tmp/bcbio/tmpLwOF4d/tmp eff -dataDir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/snpeff -hgvs -noLog -i vcf -o vcf -csvStats /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/joint/gatk-haplotype-joint/GiB-joint/GiB-joint-joint-effects-stats.csv -s /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/joint/gatk-haplotype-joint/GiB-joint/GiB-joint-joint-effects-stats.html GRCh38.86 /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/joint/gatk-haplotype-joint/GiB-joint/GiB-joint-joint.vcf.gz | /home/galaxy/bcbio-dev/anaconda/bin/pbgzip -n 2  -c > /tmp/bcbio/tmpLwOF4d/GiB-joint-joint-effects.vcf.gz

resulting in a vcf file with this header entry

##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">

and these lines in the vcf

chr1    946247  rs2272757       G       A       4033.8  PASS    AC=2;AF=1;AN=2;DB;DP=106;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=29.82;SOR=3.237;ANN=A|synonymous_variant|LOW|NOC2L|ENSG00000188976|transcript|ENST00000327044.6|protein_coding|16/19|c.1843C>T|p.Leu615Leu|1893/2790|1843/2250|615/749||,A|upstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000496938.1|processed_transcript||n.-685C>T|||||685|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000455979.1|protein_coding||c.*2094G>A|||||1988|WARNING_TRANSCRIPT_NO_START_CODON,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000464948.1|retained_intron||n.*3355G>A|||||3355|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000466827.1|retained_intron||n.*3445G>A|||||3445|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000622503.4|protein_coding||c.*2094G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000342066.7|protein_coding||c.*2094G>A|||||1672|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000616016.4|protein_coding||c.*3552G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000616125.4|protein_coding||c.*2094G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000617307.4|protein_coding||c.*2094G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618181.4|protein_coding||c.*2094G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618323.4|protein_coding||c.*3392G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618779.4|protein_coding||c.*2094G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000620200.4|protein_coding||c.*3392G>A|||||1666|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000341065.8|protein_coding||c.*2094G>A|||||1672|WARNING_TRANSCRIPT_NO_START_CODON,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000478729.1|processed_transcript||n.*4074G>A|||||4074|,A|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000474461.1|retained_intron||n.*3253G>A|||||3253|,A|non_coding_transcript_exon_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000483767.5|retained_intron|2/5|n.699C>T||||||,A|non_coding_transcript_exon_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000477976.5|retained_intron|14/17|n.3290C>T||||||;ac_exac_all=68060;an_exac_all=120386;ac_adj_exac_afr=1436;an_adj_exac_afr=10282;ac_adj_exac_amr=5557;an_adj_exac_amr=11482;ac_adj_exac_eas=5642;an_adj_exac_eas=8600;ac_adj_exac_fin=4075;an_adj_exac_fin=6550;ac_adj_exac_nfe=41521;an_adj_exac_nfe=66084;ac_adj_exac_oth=514;an_adj_exac_oth=896;ac_adj_exac_sas=9315;an_adj_exac_sas=16492;num_exac_Het=27348;num_exac_Hom=20356;af_esp_ea=0.362093023;af_esp_aa=0.843778383;af_esp_all=0.525223008;rs_ids=rs2272757;an_EOG=178;ac_EOG=G:72&A:106;gc_hom_ref_EOG=19;gc_het_alt_EOG=34;gc_hom_alt_EOG=36;af_exac_all=0.5653;af_adj_exac_afr=0.1397;af_adj_exac_amr=0.484;af_adj_exac_eas=0.656;af_adj_exac_fin=0.6221;af_adj_exac_nfe=0.6283;af_adj_exac_oth=0.5737;af_adj_exac_sas=0.5648;max_aaf_all=0.8438   GT:AD:DP:GQ:PL  1/1:0,105:105:99:4062,314,0
chr1    948245  rs4970378       A       G       3778.8  PASS    AC=2;AF=1;AN=2;DB;DP=113;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQ0=0;QD=33.44;SOR=0.711;ANN=G|upstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000483767.5|retained_intron||n.-1185T>C|||||1185|,G|upstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000496938.1|processed_transcript||n.-2683T>C|||||2683|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000455979.1|protein_coding||c.*4092A>G|||||3986|WARNING_TRANSCRIPT_NO_START_CODON,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000622503.4|protein_coding||c.*4092A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000342066.7|protein_coding||c.*4092A>G|||||3670|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000616016.4|protein_coding||c.*5550A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000616125.4|protein_coding||c.*4092A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000617307.4|protein_coding||c.*4092A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618181.4|protein_coding||c.*4092A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618323.4|protein_coding||c.*5390A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000618779.4|protein_coding||c.*4092A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000620200.4|protein_coding||c.*5390A>G|||||3664|,G|downstream_gene_variant|MODIFIER|SAMD11|ENSG00000187634|transcript|ENST00000341065.8|protein_coding||c.*4092A>G|||||3670|WARNING_TRANSCRIPT_NO_START_CODON,G|intron_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000327044.6|protein_coding|13/18|c.1558-13T>C||||||,G|intron_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000477976.5|retained_intron|11/16|n.3005-13T>C||||||;ac_exac_all=48749;an_exac_all=48752;ac_adj_exac_afr=4791;an_adj_exac_afr=4794;ac_adj_exac_amr=3240;an_adj_exac_amr=3240;ac_adj_exac_eas=3312;an_adj_exac_eas=3312;ac_adj_exac_fin=1284;an_adj_exac_fin=1284;ac_adj_exac_nfe=25942;an_adj_exac_nfe=25942;ac_adj_exac_oth=366;an_adj_exac_oth=366;ac_adj_exac_sas=9814;an_adj_exac_sas=9814;num_exac_Het=3;num_exac_Hom=24373;af_esp_ea=0.000000000;af_esp_aa=0.000457038;af_esp_all=0.000154416;rs_ids=rs4970378;an_EOG=178;ac_EOG=A:0&G:178;gc_hom_ref_EOG=0;gc_het_alt_EOG=0;gc_hom_alt_EOG=89;af_exac_all=0.9999;af_adj_exac_afr=0.9994;af_adj_exac_amr=1;af_adj_exac_eas=1;af_adj_exac_fin=1;af_adj_exac_nfe=1;af_adj_exac_oth=1;af_adj_exac_sas=1;max_aaf_all=1 GT:AD:DP:GQ:PL  1/1:0,113:113:99:3807,338,0

When loading the vcf into a db, all annotations inside the <ANN> tag seem to have gone missing. Could it be something went wrong during parsing?

vcf2db command:

/home/galaxy/bcbio-dev/anaconda/bin/vcf2db.py /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiB-joint-gatk-haplotype-joint-decompose-effects-annotated-gemini.vcf.gz /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiB-joint-gatk-haplotype-joint-decompose-effects-annotated-gemini.ped /tmp/bcbio/tmp5hDItc/GiB-joint-gatk-haplotype-joint.db

PS: this was all done automatically by bcbio

brentp commented 7 years ago

can you add a vcf with just those 2 lines to this issue? Then I can debug more easily.

matthdsm commented 7 years ago

Hi Brent, Have the whole vcf, it's just GiaB, so no big secrets here

GiB-joint-gatk-haplotype-joint-decompose-effects-annotated-gemini.vcf.gz

Thanks for the help! M

matthdsm commented 7 years ago

Hi Brent,

I think I found the issue here. It seems like the HGVS annotations are there, but under another name. The annotations can be found under condon_change and aa_change. Why would the names of these columns change so much? Is this for legacy support from the EFF field?

Also, I noticed there are columns for polyphen and sift scores, but there isn't any data in the vcf. I suppose these are also artefacts from the ANN tag parsing in geneimpacts?

Thanks again. M