Open joeflack4 opened 1 year ago
@sabrinatoro I saw that in #75 you mentioned that:
The curly brackets means this is a susceptibility term.
Example:
Phenotype Gene Symbols MIM Number Cyto Location
{Type 2 diabetes mellitus, susceptibility to}, 125853 (3) GPD2 138430 2q24.1
Just wanting to add note about that here. I'm assuming that one of the predicates in your RO spreadsheet might be appropriate for susceptibility.
I think that each mapping key should be associated with a specific relation as they mean something different:
1 - The disorder is placed on the map based on its association with a gene, but the underlying defect is not known.
I don't think these represent actual diseases. All the examples I checked represent locus and are excluded. I think we can ignore.
2 - The disorder has been placed on the map by linkage or other statistical method; no mutation has been found.
This means that we know that a gene is associated with a disease, but we don't know what variation/mutation is responsible for this disease I suggest we use: RO_0003303 (causes condition) note: gene2disease relation, might need a reciprocal one for disease2gene !!! see the caveat below regarding multiple genes.
3 - The molecular basis for the disorder is known; a mutation has been found in the gene.
This means that we know that gene is associated with a disease AND we know what variation/mutation is responsible. for OMIM, I suggest we use: RO_0004013 (is causal germline mutation in) note: gene2disease relation, might need a reciprocal one for disease2gene !!! see the caveat below regarding multiple genes.
4 - A contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.
This means that a disease was associated with multiple genes because the variation associated with the disease is a huge deletion/duplication involving multiple genes, and we don't know which one is the one causal one. I suggest we use: RO_0003304 (contributes to condition)
!!!!!!!!!!!!!!Important caveats
Some entry in the "gene symbols" column represent variation, and not actual genes. For example:
46XY sex reversal 4 (4) "DEL9p24.3, C9DELp24.3, SRXY4" 154230 9p24.3
DEL9p24.3, C9DELp24.3 are variations
SRXY4 is a gene
The relations for 2 and 3 are only for the cases when there is only 1 gene associated with the disease. If there is more than one gene associated with the disease, the suggested relation to use is either RO_0004016 (is causal germline mutation partially giving rise to) if we KNOW that the disease is digenic/olygogenic, or RO_0003304 (contributes to condition)
"?", before the phenotype name indicates that the relationship between the phenotype and gene is provisional. In these cases, we should probably skip and not curate the gene-disease relation
We should review this as a group and come into an agreement before doing anything. I will review some examples.
This is really awesome, and even more than I had hoped for. You mentioned reviewing before doing anything. I'm already so far in with adding these predicates (along with other changes in this pr) that I would hate to go back to using just plain RO_0003303
for everything like we were doing before.
Until I come back from Thanksgiving, I'm going to leave that PR in a state where it simply uses these predicates for each of the mapping keys (per your defaults above):
MORBIDMAP_PHENOTYPE_MAPPING_KEY_PREDICATES = {
'1': None, # these will be skipped
'2': RO['0003303'],
'3': RO['0004013'],
'4': RO['0003304'],
}
I realize there's a ton of nuance and more complexity here, so I know I may have to change this before my PR is merged, or later otherwise.
Overview
Right now, we are using the predicate
RO:0003303
(_causes condition_) to represent all gene::disease associations (i.e. all rows inmorbidmap.txt
). I don't think this was our intention; I just think this is where Dazhi left it about a year ago and I'm just realizing this now.Courses of action
Sabrina's has a google doc with a great collection of predicates we can consider for this: https://docs.google.com/spreadsheets/d/1bzxGT7vqQNhUHhe3vcOv-5JDGZg5izS5D7J2M2zYBxs/edit#gid=1090676757
Nearly every entry in
morbidmap.txt
has 1 mapping key. I believe there are rare edge cases of no mapping key, and I found 1 instance where there are 2 mapping keys (#81). The mapping keys come with descriptions (found in a comment at the bottom ofmorbidmap.txt
) which we can use to help us determine better predicates:Additional info
Here's how we're currently representing gene::disease associations in the latest release (the structure of this will be greatly changed very soon).
Some example rows from
morbidmap.txt
(I put one row for each of the 4 mapping keys).Related
77