Returning SO terms from variant Validator searches

AliJ752 commented 3 years ago

We are looking to add a module that provides basic sequence ontology terms for variant validator, a user story and more detail is listed below: User: •Clinical Scientists •Research Scientists •Developers Requirements/Value: •To view sequence ontology terms for queried variants, adding interpretative comments about the variant, helping with pathology classification. •Reduce time allocated to searching for sequence ontology annotations from links outside Variant Validator. •Expand the existing utility of variant validator by returning further interpretive variant data. •To be able to clone, pull and push development branches to/from the VariantValidator repository. How? •Expanding the query user interface with a Sequence Ontology type module. •Query existing ensemble API’s for sequence ontology information, alongside the existing search functions. •Give VariantValidator developers access to our repository.

Istaisa commented 3 years ago

Summary of discussion from Naomi/Ali/Pete meeting:

Have considered getting these terms from the Ensembl API. This would allow to collect all SO terms for a variant, but is dependent on an external site being up and running.
Alternative will be to create an initial, simple module to identify only a few types of variant from HGVS nomenclature (for example: p.Arg50Ter --> stop_gain)
This could be using a string or a variant validator object as input
Output will be a json object that can be integrated further down the line
Hopefully this code will provide the basis for figuring out other SO types in the future

Peter-J-Freeman commented 3 years ago

Agreed. This is a much more stable approach. We are looking to move away from hitting external APIs due to ongoing issues.

To begin working clone the Repo from the develop branch and create a new working branch.

git clone https://github.com/openvar/variantValidator.git -b develop

Then install VariantValidaor

https://github.com/openvar/variantValidator/blob/master/docs/INSTALLATION.md Remember to install the developer options on the bottom. If you have any difficulties let me know. We need to update instructions for windows.

Code up your module in the moduiles directory https://github.com/openvar/variantValidator/tree/master/VariantValidator/modules

Get in touch here if you get stuck or have any comments :)

Peter-J-Freeman commented 3 years ago

When you are installing, if you have not already done so, install the latest version of the mysql database as detailed in the instructions. The clean database method sucks. I'm removing it

Peter-J-Freeman commented 3 years ago

I also recommend using the remove version of UTA (Postgres) and not pulling SeqRepo. No point installing these since you do not need the speed. You just need it to work

Istaisa commented 3 years ago

Update on Module

Overview This module currently takes a user input, in the form of an HVGS-formatted protein variant description. There are checks in place to ensure this string is formatted correctly, and the script will exit if not. The output of the script is a json object containing the sequence ontology term, in addition to extra information including description and sequence ontology reference. This rendition of the module could be implemented for the SO terms it currently processes.

Edit: this module now also accepts a genome build and transcript variant, and fetches the protein variant from the Variant Validator API (Lines 95 to 140). Whichever final code is used to produce the protein variant should go here. Further variant sense checks have not been added in, as we expect this code will be replaced before integration.

Future Integration with Variant Validator The current input function would be replaced by passing the formatted sequence from the variant validator object to the script. The output json would be incorporated into the final variant validator object.

Expansion to further SO terms The information for each SO term is from the following reference: https://m.ensembl.org/info/genome/variation/prediction/predicted_data.html While all information has been included in the dictionary, only term, accession, description and display term have included in the output object. This is because the impact refers to how drastically the transcript/protein structure changes, NOT the impact on disease. This means that a missense mutation with a "moderate" impact could actually be highly pathogenic, or indeed highly benign. We therefore feel that including impact ranking would not be helpful.

Further SO terms can be easily added to the reference dictionary in this function using Ensembl_reference.add_entry() in the same way the current terms have been added.

Expansion to nucleotide variants Currently, this module accepts a protein variant description. The logical next step is to expand the module to handle nucleotide variants as well. Some initial ideas for this are as follows:

inframe deletion / inframe insertion - if, for an NM transcript description, there is a 'del' 'ins' or 'delins' that is a multiple of 3
frameshift - if, for an NM transcript description, there is a 'del', 'ins' or 'delins' that is not a multiple of 3

Some terms, such as start loss, would also be possible to implement from a nucleotide variant (if nucleotides 1-3 are ATG in the native transcript and are not ATG in the variant, although there is some debate about the existence of non-ATG start codons).

There are also some terms which would be interesting, but to assign from string parsing alone would require integration with further modules. For example, splice site variants are defined by Ensembl as follows: _A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron_ If a module existed which defined the exon boundaries for each transcript, which is in progress here, the splice site variant sequence ontology term could be assigned.

Issues One issue we are aware of is that in the current check for correct variant formatting it is technically possible to have a "correctly" formatted variant with other words in place of the amino acid codes (for example, NP_000079.2:p.(Moose197Fish)), which would still be identified as a missense variant. This could be solved by matching the amino acid codons to a database of acceptable one and three letter codes, instead of the current a-zA-Z|* solution. However, given this string input is going to be replaced with an already-formatted object from variant validator before the code is deployed, we believe the current formatting check is sufficient for testing.

openvar / variantValidator

Returning SO terms from variant Validator searches #257

Update on Module