Closed agitter closed 7 years ago
My quick summary:
Run to run variability between models poses a challenge. Does it also present an opportunity? Given Bin Yu's Stability paper [ http://projecteuclid.org/euclid.bj/1377612862 ] we thought it might. We developed an approach that combines neural network nodes across many denoising autoencoders via the similarity of their weight vectors. This turns out to increase the concordance of the model with known KEGG pathways (objective is not to predict pathways - but concordance with pathways seems like a reasonable proxy to compare the bio-relevance of models). It looks like this ensemble + aggregation procedure accomplishes this by capturing pathways that are otherwise best captured by models of a specific size (pathways captured by 10 node models differ from those captured by 1000 node models; but after aggregation this effect is mitigated - Figure EV3).
Signatures constructed by this approach are robust enough to perform cross-public-data-compendium analyses. We end up using them for one such analysis. We focused on media, because often a single medium is used in each experiment but many media are used across the compendium. Specifically, we define a simple 'interestingness' metric - the 'activation score' of a signature - which aims to identify signatures that are active/inactive in only a few media.
The signature with the highest activation score captured phosphate starvation. It also pointed to a new player in the process KinB, a histidine kinase, - particularly in certain conditions - specifically intermediate phosphate concentrations. We performed experiments using kinB in frame deletions to validate this, and we also performed experiments with in frame deletion mutants for other histidine kinases to confirm the specificity of this prediction.
I found this work fun because this is not something that we could have discovered with the standard experiments targeting phosphate starvation. This is because we normally aim to use conditions that produce the largest effect (namely, very high/very low phosphate levels). This way we minimize the chance that we don't find anything. However, this also means that we ignore any biology that happens at intermediate levels. By using a public data collection, in which the phosphate concentration is unintentionally perturbed across experiments, we can actually see what biology happens in that range.
I think the main message of this paper is related to the importance/utility of public data more than the method. Of course, the robust signatures are required, but the process to generate them via an ensemble of neural networks isn't something that I'd consider classical deep learning. Maybe 'shallow learning'?
@cgreene Since you are an author, I don't think we need further discussion. Closing this issue.
http://doi.org/10.1101/078659
Related to #22. @cgreene can tell us more.