Closed dlaehnemann closed 5 years ago
@jrobinso Just a quick bump-up, so that this doesn't get overlooked.
@dlaehnemann Sorry it had been overlooked. Could you attach a small example file with ANN field(s), (or add it to the test/data directory)? Thanks.
Documentation is here, noting it for future reference as field order is hardcoded: http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf
@jrobinso, sorry for the slow response -- meant to answer, now. But that is exactly the documentation I based the changes on, should've included those in the original PR message.
The field order is hard-coded, e.g. snpeff
generates the following VCF
header line to describe it:
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
For jannovar
it's:
##INFO=<ID=ANN,Number=1,Type=String,Description="Functional annotations:'Allele|Annotation|Annotation_Impact|Gene_Name|Gene_ID|Feature_Type|Feature_ID|Transcript_BioType|Rank|HGVS.c|HGVS.p|cDNA.pos / cDNA.length|CDS.pos / CDS.length|AA.pos / AA.length|Distance|ERRORS / WARNINGS / INFO'">
(Should also be Number=.
, this should be consistent in future jannovar versions: https://github.com/charite/jannovar/pull/455)
So, in theory the parsing could also be done based on the header line, if the fixed order should ever change. But for now, the fixed indexing should be fine. I'll also add two minimal VCF
files with the ANN
annotations for snpeff
and jannovar
before this PR is ready to merge.
So, here come two minimal testable files for snpeff
and jannovar
, on which tests could be based. But I'm not sure where a test would go?
Thanks for the test files. I'll write some minimal test that at least parses them and checks for errors. I was initially concerned with all the hardcoded positions, but that's how it's documented so that's how we have to parse it.
Sounds good to me, I'll delete the branch to keep the repo tidy.
Another little pull-request to enhance the ANN format field parsing:
varianttable.py: In the parsed ANN format field, add feature ID (e.g. transcript ID) and the nucleotide modified code to the displayed text in the igv-reports table. Especially the feature ID is more useful, if you can copy-paste it for further in-depth searches, which is not possible from the tooltip.