Phenotype page: Display co-occurring phenotypes.

jmcmurry commented 6 years ago

Not super high priority, but for discussion... I could have sworn there was already a ticket for this but ... for phenotype pages, I think it would be interesting to have a tab with other phenotypes that frequently co-occur. The table would also contain a column for the number of diseases in which they co-occur.

kshefchek commented 6 years ago

See http://nbviewer.jupyter.org/github/monarch-initiative/monarch-analysis/blob/master/notebooks/phenotype-co-occurrence.ipynb for some back end ideas.

kshefchek commented 6 years ago

@jmcmurry is there a specific item in the R24 for this or is it more towards a general aim? I've done some work computing similarity coefficients, p-values, and mocked up some methods to include frequency classes from the HPO in the analysis. Some of this is in the notebook above and some is off line. I'm not sure if this is useful or overkill for what we'll get from this analysis.

jmcmurry commented 6 years ago

Hannah advises: Look at "Market basket analysis"

kshefchek commented 6 years ago

here is an update (rename as .html) phenotype-co-occurrence.txt

kshefchek commented 6 years ago

The feedback I'm interested in:

Are the computations for normalization and p-values sound?
Is weighting on HPO frequency terms sound, this makes the code more complex and would force us to run and cache the computations ahead of time
What types of visualizations would be useful?
What other datatypes should be included. For example, we could add distance between terms using IC, this would be useful in searching for pleiotropy.

pnrobinson commented 6 years ago

I would say that co-occurence is more of a FYI than a statistical test, and I am not sure that users would know what to do with the information about p-values. A more important questions is how to deal with the implicit annotations. I would find it useful to know what overall categories are frequently shared, but at least on the website am less convinced people would want to see long lists of shared terms.

kshefchek commented 6 years ago

thanks this is helpful! the p-value is more for establishing a cut-off in the case that it's not obvious how to interpret some normalization of the data, compared to a correlation coefficient where you might set the cutoff at abs(.7). But see your point and thought this might be excessive.

EDIT: disregard original comment on implicit phenotype co-counts, I went about this incorrectly.

kshefchek commented 6 years ago

I've reworked the code for generating implicit co-occurrence data.

Code: It takes ~20 minutes and requires ~85g of memory. Would be interesting to see how this would perform in another language (e.g. Julia). Output: https://data.monarchinitiative.org/analysis/co-occurrence/co-occur.tsv

The top co-occurring count is greater than the number of diseases with phenotype annotations. The alternative would be convert all phenotypes and their closures into a single set per disease, but then we're missing co-occurrence on all implicit classes when two explicit terms share a common ancestor.

Next step would be to account for terms in the same lineage, or alternatively only consider terms with the same distance from the root class.

cmungall commented 6 years ago

hmm, I've managed to do this using ontobio on a laptop before

TomConlin commented 6 years ago

filtering redundant phenotypes as early as possible is key. on the server in solr would be ideal because then you would not transmit them but even making each list a set will do. takes about 6 min on my machine with your code and five minutes is loading the data,
so a rewrite in julia is likely to save !!!DOZENS!!! of seconds

note I have no permission to push to that repo but

-closure_list = [closures for closures in closure_map.values()]
+closure_list = [set(closures) for closures in closure_map.values()]

and you will be under 10G and 2 min processing (loading stays the same)

pnrobinson commented 6 years ago

It seems odd this is requiring so heavy computational resources. I had a prototype solution in Java that also arranged things according to category (but did not calculate p values) that was pretty fast. I need to refactor it after having refactored everything else to use phenol, but certainly 85g/20 minutes are excessive. It would be good to collaborate more on code like this, why don't you take a look at HPO Workbench and see if that starts to fulfil the requirements?

kshefchek commented 6 years ago

@TomConlin I think it depends what we want out of the analysis. If a disease is annotated to 'abnormal optic nerve' and 'abnormal neuron', would I want to capture that 'abnormality of the nervous system' co-occurs with itself once in this disease? If we convert the implicit classes to a set we miss this. This is why the top count in the tsv is much higher than the total count of diseases.

@pnrobinson the code here looks at co-occurrence of every explicit and implicit class all the way up to HP:0000001 (which is unnecessary). If we were to look at just a subset of categorical phenotypes it would be far less resource hungry.

TomConlin commented 6 years ago

to capture what a disease is annotated to, we would have to distinguish the terms from all their included ancestors. converting to a set means HP:0000118 shows up once per disease instead of ~25 times per disease. That is; you still get your disease associated with 'abnormality of the nervous system' but only once.

kshefchek commented 6 years ago

HP:0000118 isn't a great example because it doesn't make sense to capture, but say I have a disease annotated to 25 phenotypes that are all subclasses of 'nervous system abnormality', how many times does 'nervous sys abnormality' co-occur within that disease?

TomConlin commented 6 years ago

I am content with once.
I don't get a new grandparent through each of my cousins.

pnrobinson commented 6 years ago

With the inherited annotations, you need to count them only once per disease. That is, if a patient has abn of the brain, and abn of the spinal cord, this would naively result in two inferred annotations for abn or the nervous system, but this is wrong, because according to the HPO model the annotation needs to be counted only once.

kshefchek commented 6 years ago

Okay I was going about this wrong then!

TomConlin commented 6 years ago

It could be interesting to look at from the phenotype ancestor "score card" point of view. but you still would not compute over them just split out a count to be looked up later.

kshefchek commented 6 years ago

If it's interesting I will leave it as an option and compute it both ways. I understand everyone’s point that you can only be in one of two states of abnormality at the system level (present/absent). But say a patient presents with a mole on their arm, and an abscess on their thigh, would we not say they have two skin abnormalities occurring together?

pnrobinson commented 6 years ago

imho that is not the context of this approach-- we are not talking about what is happening in an individual patient, we are talking about whether any two diseases share an abnormality. I think it would be just confusing to double count in this way.

kshefchek commented 6 years ago

we are talking about whether any two diseases share an abnormality

It sounds like the way I'm calculating this is fundamentally wrong, as I'm looking at phenotypes occurring within the same disease.

pnrobinson commented 6 years ago

I see -- I would say not wrong but a different calculation. I was thinking that we take all diseases that have HPO:X and then ask what the most common co-occuring terms are. Possibly both calculations are interesting....

cmungall commented 6 years ago

Here is what I think should be done.

This should be done as a standard enrichment test between two gene sets. i.e. a fisher exact test for genes in P1 vs genes in P2, with appropriate correction for multiple tests. Skip tests if P1 and P2 are mutual ancestors/descendants.

Note this will give you a lot of significant matches between siblings and grandsiblings etc, so the appropriate background test is the set of all genes in the MRCAs of P1 and P2. The goal is to find latent connections not already in the ontology.

As far as implementation, I would avoid any direct computation in solr. Just load everything into main memory and do the calculations there with any necessary optimizations. The language is largely irrelevant, but note that ontobio has all the necessary calls to load into an association object any set of annotations in monarch, so the same analysis could be repeated for human PxP with genes, PxP with diseases, mouse PxP, PxP with orthologous genes (ie phenologs), PxGO, DxGO etc.

kshefchek commented 6 years ago

@cmungall can you look at the notebook here https://github.com/monarch-initiative/monarch-app/issues/1538#issuecomment-377278609 and comment if I'm setting up the fisher exact test correctly? I think we're on the same page but not certain. In your example does the intersection of diseases annotated to P1|P2 go in the 2x2 table?

monarch-initiative / monarch-legacy

Phenotype page: Display co-occurring phenotypes. #1538