Question: should we define tighter constraints for gene to phenotype query paths/joins

kshefchek commented 7 years ago

Currently we infer human gene to phenotype annotations by making the join gene - disease (through a variant) disease - phenotype

With some minor filtering of gene disease associations for markers, gwas studies, etc.

Common sources include clinvar, where we store two levels of variant to disease associations (pathogenic, likely pathogenic), OMIM, and Orphanet. However, often we see annotations to disease groups, resulting in a large number of gene to phenotype associations. FANCA is a good example: https://monarchinitiative.org/gene/NCBIGene:2175

Empirically, these tend to come from Orphanet, and this might be a consequence of the MONDO pipeline.

While there are many potential avenues for aggregating and quantifying evidence for our inferred associations, one simple short term solution would be limiting these joins to variants asserted as pathogenic for a disease in ClinVar. This can be done by adjusting the cypher query accordingly. cc @cmungall @mbrush

mbrush commented 7 years ago

Thanks for documenting this Kent. Another example of a potentially problematic inference here is when we infer over variant - contributes_to - disease (as opposed only inferrring over cases where the variant is causative/pathogenic for the disease). Such assertions also come from sources like Orphanet - where the annotated gene/variant may be one of many contributing factors. Kent's proposal to restrict gene-disease inferences to occur only across causative/pathogenic relationships would address this.

The diseases tab in the FANCA example highlights both issues:

FANCA has_phenotype FanconiAnemia leads to the inference that FANCA is associated with all FA disease phenotypes (instead of only those for the FA complementation group A disease subtype)
FANCA contributes_to Vitiligo would lead to the inference that FANCA is associated with all phenotypes of Vitiligo (although currently none are annotated to this disease)

Update: Kent has removed contributes_to from the inference path for gene-pheno associations in the cypher queries for inferring gene-pheno associations)

kshefchek commented 7 years ago

If we take a look at the pattern these are the sources that contribute to it: query

Source	Annotations
total	205054
hpoa	204065
orphanet	132047
clinvar	105678
omim	91618
gwascatalog	20921
coriell	6859
monarch	989
omia	989

So we need to consider what is coming from the omia + monarch annotation set, and possibly annotate the variant to disease relations as pathogenic (if we want these G2P assertions), or make them directly.

There are 905 OMIM gene disease annotations that are not recapitulated by ClinVar. Going through some examples I'm finding ClinVar may have some incorrect annotations: https://monarchinitiative.org/gene/NCBIGene:57410

From OMIM: SCYL1 - Acute infantile liver failure-cerebellar ataxia-peripheral sensory motor neuropathy syndrome (OMIM:616719)

From ClinVar: SCYL1 - Spinocerebellar ataxia 21 (OMIM:607454) https://www.ncbi.nlm.nih.gov/clinvar/variation/218910/ (oddly OMIM is credited)

Spinocerebellar ataxia 21 appears to be a subtype specific to a mutation in TMEM240.

There was also a case we found with Maureen where BRCA1 was incorrectly attributed to Fanconi Complementation Group A by ClinVar.

TL;DR relying on ClinVar is probably not the best move, but rather we may want to examine the number of sources making an gene to disease assertion between OMIM, Orphanet, and ClinVar (>2 = propagate - this would not work for common disease or somatic variants)

kshefchek commented 7 years ago

As a follow up, we use the has_phenotype relation for both OMIM and Orphanet. Orphanet provides gene to disease annotations to both disease groups and leaf nodes (after MONDO merge), with the breakdown:

3218 associations to leaf nodes 3361 associations to disease groups

So at this point we should leave things as is, but this is an interesting problem and we could start brainstorming on potential solutions. For now I have removed the contributes_to relation when inferring across diseases, which is important now that we're merging EFO into MONDO. For now we'll need to leave has_phenotype in when inferring across diseases, and has_phenotype and contributes_to for direct G2P queries.

monarch-initiative / monarch-cypher-queries

Question: should we define tighter constraints for gene to phenotype query paths/joins #23