greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 270 forks source link

Partitioned learning of deep Boltzmann machines for SNP data #161

Open agitter opened 7 years ago

agitter commented 7 years ago

https://doi.org/10.1101/095638 (http://biorxiv.org/content/early/2016/12/20/095638)

Learning the joint distributions of measurements, and in particular identification of an appropriate low-dimensional manifold, has been found to be a powerful ingredient of deep leaning approaches. Yet, such approaches have hardly been applied to single nucleotide polymorphism (SNP) data, probably due to the high number of features typically exceeding the number of studied individuals. After a brief overview of how deep Boltzmann machines (DBMs), a deep learning approach, can be adapted to SNP data in principle, we specifically present a way to alleviate the dimensionality problem by partitioned learning. We propose a sparse regression approach to coarsely screen the joint distribution of SNPs, followed by training several DBMs on SNP partitions that were identified by the screening. Aggregate features representing SNP patterns and the corresponding SNPs are extracted from the DBMs by a combination of statistical tests and sparse regression. In simulated case-control data, we show how this can uncover complex SNP patterns and augment results from univariate approaches, while maintaining type 1 error control. Time-to-event endpoints are considered in an application with acute myeloid lymphoma patients, where SNP patterns are modeled after a pre-screening based on gene expression data. The proposed approach identified three SNPs that seem to jointly influence survival in a validation data set. This indicates the added value of jointly investigating SNPs compared to standard univariate analyses and makes partitioned learning of DBMs an interesting complementary approach when analyzing SNP data.

cgreene commented 7 years ago

For what it's worth @brettbj tried something along these lines a few years ago. When looking at genome-wide data, the patterns captured looked more or less like ancestry patterns. It didn't seem to give much above and beyond PCA.

In this case, it looks like they get around this in the real data by a careful selection of subsets:

As described by Hieke et al. (2016b) gene expression measurements are available from a partially overlapping cohort. While in Hieke et al. (2016b) the focus had been on identifying gene expression features containing information not already conveyed by the SNP, the present idea is to use the gene expression information to reduce the number of SNPs that are considered for modeling. Specifically, we considered the SNPs mapped to the top seven genes, MAP7, TRIM37, SCAMP4, EXT2, AKT1S1 and MT3, identified by a stagewise regression approach from the gene expression data by Hieke et al. (2016b), resulting in a list of 70 SNPs for subsequent modeling by partitioned deep Boltzmann machines.

The paper is a bit light on comparisons. I expect that it would be feasible to model combinations of 70 SNPs with other approaches as well.

agitter commented 7 years ago

We don't need to include it. Should we close the issue?

cgreene commented 7 years ago

Maybe see what @brettbj thinks. We were excited enough to kick the tires, so I guess we might want to comment on it. It could be a nice chance to raise the point that there are other approaches to find structure in data as well. Maybe this is an area where the extent of advances to date is unclear. I'm excited about the challenge of variants -> phenotype connection. I just don't see us achieving it at this stage via only structure discovery algorithms applied to variant data without intermediate levels of biology also captured somewhere in the model. I guess if I care enough to have this much discussion on it, I should probably write a little bit for the paper. We can always chop it out if it's not helpful.