quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

Problem creating database from vcf file with custom annotations #34

Closed marivalen closed 6 years ago

marivalen commented 6 years ago

Hi,

I have the following vcf file: reduced.txt With added custom annotation. When I run it with vcf2db I get no errors and all my fields are in the database as expected.

The problem is that in the above vcf the additional annotation is only in the transcript that has the most harmful effect for that variant.

My original code adds the additional annotations to all the transcripts because I do not know which one is the most harmful. When I try to run vcf2db with this vcf: allanno.txt

Then I get the following error:

Traceback (most recent call last): File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 865, in impacts_extras=a.impacts_field, aok=a.a_ok) File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 219, in init self.load() File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 285, in load i = self._load(self.cache, create=True, start=1) File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 278, in _load self.insert(variants, expanded, keys, i, create=create) File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 307, in insert v in variants) File "/home/marivalen/Documents/PhD_webpage/vcf2db/vcf2db.py", line 758, in gene_info top = geneimpacts.Effect.top_severity(impacts) File "/tmp/geneimpacts/geneimpacts/effect.py", line 402, in top_severity effects = sorted(effects) File "/usr/lib/python2.7/functools.py", line 60, in ('lt', lambda self, other: self <= other and not self == other), File "/tmp/geneimpacts/geneimpacts/effect.py", line 367, in le if self.severity != other.severity: File "/tmp/geneimpacts/geneimpacts/effect.py", line 442, in severity for i, c in [(i, c) for i, c in enumerate(self.consequences) if not c in sev]: File "/tmp/geneimpacts/geneimpacts/effect.py", line 537, in consequences res = _cache[self.effects['Consequence']] = list(it.chain.from_iterable(x.split("+") for x in self.effects['Consequence'].split('&'))) KeyError: 'Consequence'

Is there a way that I can use to know which is the most harmful transcript?

Or why is that for the other transcripts I cannot add the additional annotations?

Kind Regards, Maria Marin

marivalen commented 6 years ago

I tried some more testing today and I narrowed the problem down to the |0,0,0| annotation that you can see in the allanno.txt. Since your code expects that it separates the CSQ field by the ",". I will make a check that changes the commas for another character but maybe you have a better solution in code? Or if not, maybe it is beneficial to write in the documentation that the new annotation must not have commas?

brentp commented 6 years ago

CSQs are joined by commas so you can not add an additional field inside the CSQ that is separated with commas.