Open kshefchek opened 7 years ago
Thanks for documenting this Kent. Another example of a potentially problematic inference here is when we infer over variant - contributes_to - disease (as opposed only inferrring over cases where the variant is causative/pathogenic for the disease). Such assertions also come from sources like Orphanet - where the annotated gene/variant may be one of many contributing factors. Kent's proposal to restrict gene-disease inferences to occur only across causative/pathogenic relationships would address this.
The diseases tab in the FANCA example highlights both issues:
Update: Kent has removed contributes_to from the inference path for gene-pheno associations in the cypher queries for inferring gene-pheno associations)
If we take a look at the pattern these are the sources that contribute to it: query
Source | Annotations |
---|---|
total | 205054 |
hpoa | 204065 |
orphanet | 132047 |
clinvar | 105678 |
omim | 91618 |
gwascatalog | 20921 |
coriell | 6859 |
monarch | 989 |
omia | 989 |
So we need to consider what is coming from the omia + monarch annotation set, and possibly annotate the variant to disease relations as pathogenic (if we want these G2P assertions), or make them directly.
There are 905 OMIM gene disease annotations that are not recapitulated by ClinVar. Going through some examples I'm finding ClinVar may have some incorrect annotations: https://monarchinitiative.org/gene/NCBIGene:57410
From OMIM: SCYL1 - Acute infantile liver failure-cerebellar ataxia-peripheral sensory motor neuropathy syndrome (OMIM:616719)
From ClinVar: SCYL1 - Spinocerebellar ataxia 21 (OMIM:607454) https://www.ncbi.nlm.nih.gov/clinvar/variation/218910/ (oddly OMIM is credited)
Spinocerebellar ataxia 21 appears to be a subtype specific to a mutation in TMEM240.
There was also a case we found with Maureen where BRCA1 was incorrectly attributed to Fanconi Complementation Group A by ClinVar.
TL;DR relying on ClinVar is probably not the best move, but rather we may want to examine the number of sources making an gene to disease assertion between OMIM, Orphanet, and ClinVar (>2 = propagate - this would not work for common disease or somatic variants)
As a follow up, we use the has_phenotype relation for both OMIM and Orphanet. Orphanet provides gene to disease annotations to both disease groups and leaf nodes (after MONDO merge), with the breakdown:
3218 associations to leaf nodes 3361 associations to disease groups
So at this point we should leave things as is, but this is an interesting problem and we could start brainstorming on potential solutions. For now I have removed the contributes_to relation when inferring across diseases, which is important now that we're merging EFO into MONDO. For now we'll need to leave has_phenotype in when inferring across diseases, and has_phenotype and contributes_to for direct G2P queries.
Currently we infer human gene to phenotype annotations by making the join gene - disease (through a variant) disease - phenotype
With some minor filtering of gene disease associations for markers, gwas studies, etc.
Common sources include clinvar, where we store two levels of variant to disease associations (pathogenic, likely pathogenic), OMIM, and Orphanet. However, often we see annotations to disease groups, resulting in a large number of gene to phenotype associations. FANCA is a good example: https://monarchinitiative.org/gene/NCBIGene:2175
Empirically, these tend to come from Orphanet, and this might be a consequence of the MONDO pipeline.
While there are many potential avenues for aggregating and quantifying evidence for our inferred associations, one simple short term solution would be limiting these joins to variants asserted as pathogenic for a disease in ClinVar. This can be done by adjusting the cypher query accordingly. cc @cmungall @mbrush