monarch-initiative / omim

Data ingest pipeline for OMIM.
6 stars 2 forks source link

GeneDiseaseAssociation predicates #79

Open joeflack4 opened 1 year ago

joeflack4 commented 1 year ago

Overview

Right now, we are using the predicate RO:0003303 (_causes condition_) to represent all gene::disease associations (i.e. all rows in morbidmap.txt). I don't think this was our intention; I just think this is where Dazhi left it about a year ago and I'm just realizing this now.

Courses of action

Sabrina's has a google doc with a great collection of predicates we can consider for this: https://docs.google.com/spreadsheets/d/1bzxGT7vqQNhUHhe3vcOv-5JDGZg5izS5D7J2M2zYBxs/edit#gid=1090676757

Nearly every entry in morbidmap.txt has 1 mapping key. I believe there are rare edge cases of no mapping key, and I found 1 instance where there are 2 mapping keys (#81). The mapping keys come with descriptions (found in a comment at the bottom of morbidmap.txt) which we can use to help us determine better predicates:

# 1 - The disorder is placed on the map based on its association with a gene, but the underlying defect is not known. # 2 - The disorder has been placed on the map by linkage or other statistical method; no mutation has been found. # 3 - The molecular basis for the disorder is known; a mutation has been found in the gene. # 4 - A contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.

Additional info

Here's how we're currently representing gene::disease associations in the latest release (the structure of this will be greatly changed very soon).

OMIM:100678 a owl:Class ;
    rdfs:label "ACAT2" ;
    RO:0002525 CHR:9606chr6q25.3 ;
    ...
    RO:0003303 OMIM:614055 ;
    biolink:category biolink:Gene ;
    biolink:has_evidence "The disorder is placed on the map based on its association with a gene, but the underlying defect is not known." .

Some example rows from morbidmap.txt (I put one row for each of the 4 mapping keys).

Phenotype   Gene Symbols    MIM Number  Cyto Location
?ACAT2 deficiency, 614055 (1)   ACAT2   100678  6q25.3
?Anal canal carcinoma (2)   ANC 105580  11q22-qter
17,20-lyase deficiency, isolated, 202110 (3)    CYP17A1, CYP17, P450C17 609300  10q24.32
?Pain sensitivity QTL1 (4)  PAINQTL1    618377  1p33

Related

joeflack4 commented 1 year ago

@sabrinatoro I saw that in #75 you mentioned that:

The curly brackets means this is a susceptibility term.

Example:

Phenotype   Gene Symbols    MIM Number  Cyto Location
{Type 2 diabetes mellitus, susceptibility to}, 125853 (3)   GPD2    138430  2q24.1

Just wanting to add note about that here. I'm assuming that one of the predicates in your RO spreadsheet might be appropriate for susceptibility.

sabrinatoro commented 1 year ago

I think that each mapping key should be associated with a specific relation as they mean something different:

1 - The disorder is placed on the map based on its association with a gene, but the underlying defect is not known.

I don't think these represent actual diseases. All the examples I checked represent locus and are excluded. I think we can ignore.

2 - The disorder has been placed on the map by linkage or other statistical method; no mutation has been found.

This means that we know that a gene is associated with a disease, but we don't know what variation/mutation is responsible for this disease I suggest we use: RO_0003303 (causes condition) note: gene2disease relation, might need a reciprocal one for disease2gene !!! see the caveat below regarding multiple genes.

3 - The molecular basis for the disorder is known; a mutation has been found in the gene.

This means that we know that gene is associated with a disease AND we know what variation/mutation is responsible. for OMIM, I suggest we use: RO_0004013 (is causal germline mutation in) note: gene2disease relation, might need a reciprocal one for disease2gene !!! see the caveat below regarding multiple genes.

4 - A contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.

This means that a disease was associated with multiple genes because the variation associated with the disease is a huge deletion/duplication involving multiple genes, and we don't know which one is the one causal one. I suggest we use: RO_0003304 (contributes to condition)

!!!!!!!!!!!!!!Important caveats

We should review this as a group and come into an agreement before doing anything. I will review some examples.

joeflack4 commented 1 year ago

This is really awesome, and even more than I had hoped for. You mentioned reviewing before doing anything. I'm already so far in with adding these predicates (along with other changes in this pr) that I would hate to go back to using just plain RO_0003303 for everything like we were doing before.

Until I come back from Thanksgiving, I'm going to leave that PR in a state where it simply uses these predicates for each of the mapping keys (per your defaults above):

MORBIDMAP_PHENOTYPE_MAPPING_KEY_PREDICATES = {
    '1': None,  # these will be skipped
    '2': RO['0003303'],
    '3': RO['0004013'],
    '4': RO['0003304'],
}

I realize there's a ton of nuance and more complexity here, so I know I may have to change this before my PR is merged, or later otherwise.