Closed pnrobinson closed 4 years ago
This is looks to be possibly an integration bug, HPO calls OMIM:107680 a disease, we're calling it a gene and merging it with HGNC:600. The pathway inference will hop over direct gene to phenotype associations, which typically we don't have for humans and only for nonhuman data. But in this case there is a direct edge because of the gene/disease node mixup.
AFAIK HPO does not list this as a disease, e.g. https://hpo.jax.org/app/browse/gene/335 where are you seeing this?
When I go to https://hpo.jax.org/app/browse/term/HP:0004398, in the disease column I see OMIM:107680, which goes to https://hpo.jax.org/app/browse/disease/OMIM:107680
That is a bug, thanks for pointing it out @iimpulse let's touch bases about this!
Here are 23 ids in which we have an unexpected direct gene-phenotype association, not certain this is from our HPO ingest but it might be worth checking these as well:
OMIM:610271
OMIM:141900
OMIM:109270
OMIM:300897
OMIM:141800
OMIM:107680
OMIM:177400
OMIM:182870
OMIM:159555
OMIM:187395
OMIM:114835
OMIM:600522
OMIM:400048
OMIM:147892
OMIM:138300
OMIM:142000
OMIM:152200
OMIM:151430
OMIM:116790
OMIM:124060
OMIM:132810
OMIM:168820
OMIM:173470
There appears to be a bug on the HPO website as well that affects these items. I will try to investigate next week with MG
great! I added a qc check for this as well so I can let you know if it pops up again after it's fixed.
@kshefchek Thanks for picking this up. It seems to be related to some changes upstream, with phenotypes from '+' entries migrating to '#' entries. I have started to correct this and it should be taken care of by next week.
https://github.com/monarch-initiative/hpo-annotation-data/issues/421
Of note for the MonarchUI -- it does not seem that we are taking the OMIM '+' entries into account as diseases (the '+' entries describe genes and diseases simultaneously), and these are actually valid (although often difficult) entries. Also, there were two entries that did not seem to be erroneous in the small files, please check
This is good to know and I wonder if dipper is affected by this change as well (@TomConlin)
Looks like I made a mistake on two of them: OMIM:300897 should have been ORPHA:85283 OMIM:400048 should have been OMIM:400003
It's been a while since I've looked at OMIM types, but glancing at it I think you're correct, we treat + entries more like genes than diseases
I have fixed everything that could be easily fixed. There are three entries that might need to go onto our omit list, but they are borderline cases that I will make issues for so people can chime in. The new version of phenotype.hpoa is being made as we speak and should have these corrections.
https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/OMIMSource.py#L194
Plus (+) becomes typed as 'has_affected_feature' "GENO:0000418"
if it should be something else, please let me know what that is.
octothorp (#) is typed as Phenotype
https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/OMIMSource.py#L189
We do maintain the list of obsolete/previous omim-numbers so if they split we can return both new types
I think we type these as genes, see https://github.com/monarch-initiative/dipper/pull/725#pullrequestreview-232806448
typing them as 'has_affected_feature' wouldn't make sense because this is a predicate/object property.
Will try again:
Which term for a "type" which is not already taken by another conditional would be a better that the term 'has_affected_feature' please note it should not be 'gene' (here) as that is taken by the more exact designation asterisk (*) and conflating the two is a choice to be made when used down stream from here.
Recall these type designations are shared in half a dozen ingests which might differ in the flavor of how the designation is interpreted.
as OMIM describes it it's a union type of phenotype (disease in monarch land) and a gene: "A plus sign (+) before an entry number indicates that the entry contains the description of a gene of known sequence and a phenotype. "
RDF supports this, we just type it as both. However, this gets trickier from an application perspective where we do not want an identifier to represent both a gene and a disease. In the sample we picked these looked more like genes so we type them as genes in our RDF model iirc.
typing them in the code as 'has_affected_feature' is fine, and I can't think of anything better. As long as this typing doesn't make it to the RDF model - which I'm pretty certain it isn't.
I never liked the term as it smacks of being weasel wordy. but is must have seemed the least confusing of the preexisting usages.
we could make up a new label in the global translation, such as globaltt['omim_phenotype_and_gene'] that resolves to the SO term for a gene.
No.
Made up terms which do not resolve somewhere official along with their description, structure and additional information are an abomination.
Get someone to put it in OLS/Ontobee and we can talk about it.
How about a made up label that resolves to a real term, localtt[‘omim_phenotype_and_gene’]: ‘gene’
That at least is legit. especially if another source had
localtt[‘omim_phenotype_and_gene’]: ‘phenotype’
We still have some of these discrepancies as of the July 2020 dataset, the issue being the monarch considers OMIM plus sign identifiers as genes, and HPO considers these diseases (OMIM:151430, OMIM:109270, and a handful more).
However, the original issue being that there is a phenotype annotated to https://beta.monarchinitiative.org/pathway/REACT:R-HSA-381426 has been fixed, so I'm going close this
It is a little unclear why there is a link between this pathway and 8 phenotypes
https://beta.monarchinitiative.org/pathway/REACT:R-HSA-381426#phenotype
If we want to use transitive relations (pathway->gene->disease) then this pathway is related to over 100 diseases currently which in turn have hundreds of phenotypic features, so we should either be listing zero or hundreds, but not 8.