greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 272 forks source link

System-wide automatic extraction of functional signatures in Pseudomonas aeruginosa with eADAGE #101

Closed agitter closed 7 years ago

agitter commented 7 years ago

http://doi.org/10.1101/078659

Related to #22. @cgreene can tell us more.

Abundant public expression data capture gene expression across diverse conditions. These steady state mRNA measurements could reveal the transcriptional consequences of cells' genetic backgrounds or their responses to the environment. However, public data remain relatively untapped, in part because extracting biological signal as opposed to technical noise remains challenging. Here we introduce a procedure, termed eADAGE, that performs unsupervised integration of public expression data using an ensemble of neural networks as well as heuristics that, given a dataset, help users identify an appropriate level of model complexity. This ensemble modeling approach captures biological pathways more clearly than existing methods, enabling analyses that span entire public gene expression compendia such as that for the bacterium Pseudomonas aeruginosa. These analyses reveal a previously undiscovered feature of the phosphate starvation response apparent in public data: a sensor kinase, KinB, that is required for full activation of the response to phosphate at intermediate concentrations. Our molecular validation experiments confirm this role of KinB and our screen of a histidine kinase knock out collection confirmed the prediction's specificity. Public data are captured from a broad range of conditions in diverse organism backgrounds and may provide a unique opportunity to identify these subtle and context-specific regulatory interactions. Algorithms that extract biological signal from these data, such as eADAGE, can highlight opportunities to discover mechanisms that are apparent from but unrealized in public data.

cgreene commented 7 years ago

My quick summary:

Run to run variability between models poses a challenge. Does it also present an opportunity? Given Bin Yu's Stability paper [ http://projecteuclid.org/euclid.bj/1377612862 ] we thought it might. We developed an approach that combines neural network nodes across many denoising autoencoders via the similarity of their weight vectors. This turns out to increase the concordance of the model with known KEGG pathways (objective is not to predict pathways - but concordance with pathways seems like a reasonable proxy to compare the bio-relevance of models). It looks like this ensemble + aggregation procedure accomplishes this by capturing pathways that are otherwise best captured by models of a specific size (pathways captured by 10 node models differ from those captured by 1000 node models; but after aggregation this effect is mitigated - Figure EV3).

Signatures constructed by this approach are robust enough to perform cross-public-data-compendium analyses. We end up using them for one such analysis. We focused on media, because often a single medium is used in each experiment but many media are used across the compendium. Specifically, we define a simple 'interestingness' metric - the 'activation score' of a signature - which aims to identify signatures that are active/inactive in only a few media.

The signature with the highest activation score captured phosphate starvation. It also pointed to a new player in the process KinB, a histidine kinase, - particularly in certain conditions - specifically intermediate phosphate concentrations. We performed experiments using kinB in frame deletions to validate this, and we also performed experiments with in frame deletion mutants for other histidine kinases to confirm the specificity of this prediction.

I found this work fun because this is not something that we could have discovered with the standard experiments targeting phosphate starvation. This is because we normally aim to use conditions that produce the largest effect (namely, very high/very low phosphate levels). This way we minimize the chance that we don't find anything. However, this also means that we ignore any biology that happens at intermediate levels. By using a public data collection, in which the phosphate concentration is unintentionally perturbed across experiments, we can actually see what biology happens in that range.

I think the main message of this paper is related to the importance/utility of public data more than the method. Of course, the robust signatures are required, but the process to generate them via an ensemble of neural networks isn't something that I'd consider classical deep learning. Maybe 'shallow learning'?

agitter commented 7 years ago

@cgreene Since you are an author, I don't think we need further discussion. Closing this issue.