Verifying Experimental Analysis Design

I talked with @dhimmel yesterday and we came up with a design for determining whether or not adding input from a deep learning model (LSTM) is beneficial for predicting relationships between Diseases and Genes.

Background:

project overview

Within the image above we have all disease-gene pair mappings where some edges are mentioned in pubmed abstracts (noted by the black dashes) and majority of edges aren’t mentioned at all. The edges in green are considered true edges as they are currently contained in hetnet v1 and the other edges (not highlighted) have the potential to be a true Disease-Gene relationship. We aim to classify each edge as either positive (True edge) or negative (False edge), under the hypothesis that using NLP and deep learning (Long short term memory networks or LSTM for short) will provide better accuracy than standard methods.

Analysis Design: To test this hypothesis we plan to use the following design:

Prior	Co-occurrences	Natural Language Processing (NLP)
1 Model	1 Model with sentences	1 Model with Sentences
	1 Model w/o Sentences	1 Model w/o Sentences
Literature unaware	LSTM unaware	LSTM aware

The prior category is where we plan to use a model to classify each disease-gene edge without using any information from biomedical literature (hence literature unaware). The co-occureence category is where we plan to use a model that combines the prior category model with information obtained from biomedical literature i.e. (expected number of sentences that mentions a given disease-gene pair, the p-value for each disease-gene edge, how many unique abstracts that mention a given disease-gene pair etc.) To note this model doesn’t use the LSTM and just relies on the features extracted from the literature itself. A challenge for this will be handling the edges that aren’t mentioned within the literature itself. (Model w/o Sentences) Lastly, the NLP category combines the other two models and adds input from a deep learning model (probability that a sentence is evidence for a true disease-gene relationship). We expect to see the NLP category model outperform the models from the other two categories.

Challenges:

What is a fair prior model to use for this analysis?
What do we do about edges that are in hetnet, but aren’t mentioned in literature? How can we classify these edges?

Great summary of our brainstorm @danich1!

What is a fair prior model to use for this analysis?

The prior should just be the probability that the disease is associated with the gene based only on the degree of the gene and disease (in the training network). See this notebook, which computes these prior probabilities and should only need minimal modifications. Note that for this analysis you won't be fitting any classifier model... you will use the prior probability directly to rank the observations for the ROC curve.

What do we do about edges that are in hetnet, but aren’t mentioned in literature? How can we classify these edges?

Relationships without any sentences will still have some features:

a prior probability as computed above
marginal sentence counts (how many documents/sentences does the gene appear in / how many documents/sentences does the disease appear in)?
potentially a co-occurrence p-value of 1

As a result, for the NLP predictions, you will have to fit a fallback model for observations with no sentences. Therefore, not all predictions in the NLP stage will use NLP info (only the ones that have sentences). Of course, an important limitation of NLP is that it only works for observations with sentences and the ROC curve should reflect that.

greenelab / snorkeling

Verifying Experimental Analysis Design #28