monarch-initiative / GENO-ontology

Repository for representing genotypes and their association with phenotypes
18 stars 6 forks source link

Clarification on "gene allele" http://purl.obolibrary.org/obo/GENO_0000014 #34

Open nataled opened 7 years ago

nataled commented 7 years ago

The Protein Ontology has been tasked with taking over the protein-related terms from MRO (Major Histocompatibility Complex (MHC) restriction ontology). The MRO terms are often defined with respect to a locus, which can include multiple syntenic genes. For example, HLA-A, HLA-B, and HLA-C are all genes located at the HLA locus (human); H2-Q1 through H2-Q15 are all genes located at the H2-Q locus (mouse). A search for the term "locus" (or related) brought me to your term "gene allele" (synonym: "gene locus").

I was pleased to see the term defined as I would expect for a locus--that is (paraphrasing) with respect to position. However, the placement in the GENO hierarchy (viewed after reasoning) is odd, as it appears to be tied (indirectly) to sequence (since alleles are, ultimately, sequence-based). The hierarchy, in contrast to the definition, would make this term equivalent in meaning to "an allele of a gene" (as opposed to, say, an allele of a nucleotide, to use an example from the parent term "allele") and not a locus, per se. It thus appears that the logical definitions (equivalencies) are in conflict with the text definition.

The simple fix, taking the shortest route (so to speak), would be to revise the text definition to reflect the logical one, maintaining its position in the hierarchy. Then, a new term "locus" would be minted using the current text definition of gene allele (however, see notes below).

I would expect to use this new locus term along the lines of the following:

encoded_by found_within (the relations are placeholders, but I trust their intent is understood) Here I'm using locus in the sense of the text definition. It's possible to make child terms specifying things like gene locus (the place where a specific gene is typically found), etc. Note 1: It isn't clear to me what "canonical allele" is. (1) By label I would say it's the same as reference allele, (2) by definition it seems more like be the aforementioned locus I requested, and (3) by comment it would seem its child terms represent the bearers of the features given under sequence feature. If (2) is the case, I would not expect allele to be subclassed here. I'm hoping that (3) is correct, as it would mean this term gives clarity to the issue plaguing SO, which conflates information content entities with material entities. The comment within "sequence feature" lends credence to the notion that "canonical allele" is the "biological sequence" mentioned, and thus (3) pertains. Note 2: Based on other considerations (such as explanations given in child terms), it would seem "gene allele" is serving as three entities: an allele of a gene, the locus of a gene, and a locus in general.
mbrush commented 7 years ago

Hi @nataled. Formulating some thoughts here - will reply soon. Thanks for the question/feedback!

mbrush commented 6 years ago

Hi @nataled - thanks for your patience, and pardon my long response below . . . once I got going I had a lot to say here!

Thoughts on GENO:

First, regarding 'canonical allele', please ignore this term altogether. It is based on a concept from the ClinGen Allele model, and was added to GENO simply to provide an ontological identifier for this concept to support data model integration. But logically, it should be ignored as it is not yet clear how it relates to other concepts in GENO. The problem is that its cursory logical definition (="variant OR allele") had the unintentional consequence of this class subsuming other core GENO classes in the inferred hierarchy (e.g. allele, gene allele). I have since removed this logical definition to avoid such subsumptions, so please revisit this in GENO. Reasoning yield a childless 'canonical allele' class that you should just ignore for now.

Second, the timing on the 'locus'-related question is wonderful, as I am in the process of clarifying the use of the term 'locus' across GENO. As you may well know, the word "locus" can be problematic due to its varied meaning and use - it can refer to a location in a genome, or to an extent of sequence present at a defined location in the genome. While this may be an acceptable conflation in scientific discourse, the distinction is important when modeling terms in a formal ontology.

In GENO, we had originally used 'locus' in the latter sense above ("an extent of sequence present at a defined location"), but this proved confusing for some users. I have just finished updating all labels in GENO to eliminate this use of 'locus'. For example, I replaced the label 'genomic locus' with 'genomic feature', and the label 'gene locus' with 'gene allele'. Any remaining uses of 'locus' in GENO should now describe a location in the genome, rather than an extent of sequence in the genome. In most cases, I now use 'feature' (sensu Sequence Ontology) rather than 'locus' to refer to extents of sequence identified by their position w.r.t some reference genome.

Given these updates and improvements to GENO, I would recommend you take a fresh look at GENO and revisit the terms you mention above. I hope that you will find things to make more sense now - but happy to chat more if not. Ultimately we want GENO to be clear and usable for a variety of use cases, and are happy to evolve and refine it as needed to be maximally useful.

A final note about GENO is that I have yet to implement a term in GENO representing this clarified concept of a 'genomic locus' as a location in the genome. But I probably should, to be clear and direct about what we mean when we use 'locus' word in the definitions or descriptions of other GENO classes. I will work on this and alert you when it has been implemented. I will likewise define a class for 'gene locus', which again will describe the genomic location where a gene is typically found. This is in contrast to the notion of "the sequence at the location where a gene is typically found" - which we use the term "gene allele" to define. The relationship between a gene allele and gene locus is that the allele "occupies" or "is_located_at" the locus. So in GENO we think of the 'locus' as a genomic address, and the 'allele' as the sequence feature that occupies this address.


Thoughts on the MRO:

I also took a peek at the MRO, and have a couple questions and suggestions for how GENO might align or be used here. First, classes like 'HLA-A locus' seem to refer to what GENO would call the 'HLA-A gene'. There are no definitions for these classes in the MRO, but based on the definition of the 'MHC locus' root class ("region of a chromosome that codes for MHC molecules"), I would surmise that the 'HLA-A locus' class is "the region of the chromosome that codes for the HLA-A protein". This would make it equivalent to what GENO would call the 'HLA-A gene'. So I would advocated for calling these classes 'genes' instead of 'loci'. Seems like this would be internally consistent, align well with the terminology of GENO and SO, and avoid confusion caused by different uses of the term 'locus'. That said, I am not an expert in MHC biology or nomenclature, so there may be a good reason they are using 'locus' here instead of 'gene'.

With this disclaimer in mind, I would make the following recommendations for the MRO:

  1. Use the GENO 'genomic feature' class in place of the REO 'genetic locus' currently used by MRO as the parent of 'MHC locus' (the REO term is a placeholder and not really in scope for a reagent ontology)
  2. Re-label the MRO 'MHC locus' class to be called 'MHC locus gene' or 'MHC gene' - since descendants of this term seem to be genes located within the MHC locus.
  3. Update labels of all specific MHC loci to be called 'genes' (e.g. 'HLA-A locus'-->'HLA-A gene'). This is consistent with my understanding that MRO aims to describe the sequence regions that are transcribed/translated to produce MHC proteins, and not the 'locations' in coordinate space where these genes reside. This approach would be consistent with GENO from a modeling and terminological perspective.

Hope this helped, and I am happy to provide help or additional feedback on your efforts to align/integrate the MRO and PRO. Our group has done a lot of modeling in these areas across the different OBOs we have contributed to, and are keenly interested in harmonizing representations where possible. This seems like an area where a little it of collaboration could go a long way toward a more interoperable set of ontologies.

mbrush commented 6 years ago

@nataled just following up on this to see if you had further questions or requests w.r.t. GENO. Happy to work with you to make GENO address your needs. Thanks!

nataled commented 6 years ago

I have been caught up in a major project that diverted and absorbed my attention for quite some time. I'm also awaiting some feedback on other issues related to MRO which might affect this issue. I hope to get back to this soon.