Gene expression inference with deep learning

cgreene commented 8 years ago

https://doi.org/10.1093/bioinformatics/btw074

gwaybio commented 8 years ago

Great example of deep learning (feed forward neural network) significantly outperforming a simpler machine learning algorithm (linear regression) on an important task (predicting gene expression from an informative panel). Also demonstrates ability of classifier trained on microarray data to infer RNAseq data.

Biology

Inferring expression of ~10,000 genes using only the measurements of a representative set of ~1000 genes
Comparing performance of algorithm to infer RNAseq vs. Microarray assayed genes
Trained on GEO data and evaluated on holdout GEO, GTEx, and 1000 genomes expression data

Computational aspects

Feed forward neural network
- Trained with dropout, momentum, adaptive learning rate and with glorot initialized weights
- Three hidden layers
- 9000 hidden units in each layer shown to perform best
Performance evaluated using mean absolute error (MAE) at a per gene level
- Compared to standard LINCS method (linear regression) and KNN regression
Nice discussion of visualizing network components

Results

~15% improvement in MAE in NN compared to LR, KNN regression performed poorly
99.97% of genes show improvement in NN method compared to LR
Trained NN on microarray data predicts RNAseq genes very well
- Different optimal hyperparameters
- e.g. Two hidden layers (9000 x 2)
- NN does not have as significant improvements (but still improvements!) compared to LR
Visualization techniques
- Sparse connections from input layer to hidden layer
- Certain hidden layers are hubs of activity - probably capture some central co-regulated organization principles

agitter commented 8 years ago

Ideally, we should also configure D-GEX with 9520 units in the output layer corresponding to the 9520 target genes. However, each of our GPUs has only 6 GB of memory, thus we cannot configure hidden layers with sufficient number of hidden units if all the target genes are included in one output layer. Therefore, we randomly partitioned the 9520 target genes into two sets that each contains 4760 target genes. We then built two separate neural networks with each output layer corresponding to one half of the target genes.

This part was unfortunate. I wonder how much better they could have done without this artificial limitation.

Their code has pairs of scripts for training networks on the first half of the data and then the second half. It may not be too difficult to train on all of the genes if someone is feeling especially curious (not me).

gwaybio commented 8 years ago

That is unfortunate - makes it even more impressive how much better their performance was than LR

dhimmel commented 8 years ago

Greetings, we discussed this paper in the 9/9 at 9 Greene Lab Journal Club (see slides). My take was that their deep learning model (D-GEX) did perform almost universally better than linear regression. However, both performed poorly:

deep learning reduced imputation error from 38% to 32% … still too much error for many expression applications. [Source: Tweet]

I originally was interested in this paper because of the poor imputation quality in LINCS L1000. While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.

agitter commented 8 years ago

While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.

This could be a nice theme for the review. Almost every paper will show that neural networks are better than baseline regression/classification techniques. But when are the improvements enough to make a practical difference in the domain?

cgreene commented 8 years ago

@agitter Totally agree with that sentiment! That's what I really want to see. @dhimmel - is it feasible to re-do your imputation quality analysis if the authors would provide their new imputed data. It may be possible to request it from them.

dhimmel commented 8 years ago

The imputed LINCS data from the study is available as described in their Methods:

we have re-trained GEX-10%-9000 × 3 using all the 978 landmark genes and the 21 290 target genes from the GEO data and inferred the expression values of unmeasured target genes from the L1000 data. The full dataset consists of 1 328 098 expression profiles and can be downloaded at https://cbcl.ics.uci.edu/public_data/D-GEX/l1000_n1328098x22268.gctx. We hope this dataset will be of great interest to researchers who are currently querying the LINCS L1000 data.

Note that l1000_n1328098x22268.gctx is 110 GB. Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or https://github.com/dhimmel/lincs/issues/3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.

Tagging the study authors @admiral-chen, @yil8, and @in4matx to see if they can provide more information regarding l1000_n1328098x22268.gctx and its relation to modzs.gctx.

cgreene commented 8 years ago

This one garnered quite a bit of discussion so I won't close it at this point. @admiral-chen, @yil8, and @in4matx - would be nice to highlight your contribution - can you provide some quick info on whether or not the potential eval is feasible?

in4matx commented 8 years ago

We ran a contest recently with a number of folks submitting their improvements to inference and they were scored against a benchmark. While the contest is over, several folks have asked for the benchmarks so that their ideas can be compared to the current best performer.

Let me know if you are interested in comparing your algorithm.

https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337

On Fri, Oct 14, 2016 at 1:16 PM, Casey Greene notifications@github.com wrote:

This one garnered quite a bit of discussion so I won't close it at this point. @admiral-chen https://github.com/admiral-chen, @yil8 https://github.com/yil8, and @in4matx https://github.com/in4matx - would be nice to highlight your contribution - can you provide some quick info on whether or not the potential eval is feasible?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/24#issuecomment-253864425, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381tMSYaIDFL8flb-54o921mq_EkOrks5qz7kAgaJpZM4Jdwr3 .

cgreene commented 8 years ago

Thanks for chiming in @in4matx! We don't have a solution. We were discussing this paper that you were a co-author on that used deep learning. In an evaluation of the previous imputation ( https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene-expression-in-lincs-l1000/185 ), @dhimmel found that the imputed genes had a very different distribution than the directly measured genes in their knockdown/overexpression experiments.

What we're particularly interested in is whether or not the deep learning approach in your paper - which reduces imputation error - also affects these distributions.

Quoting @dhimmel - we need to know:

Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.

in4matx commented 8 years ago

Oh, sorry I didnt catch that.

The application of deep learning was an exploratory analysis applied to a very specific set of criteria. I'd ping Xiaohui Xie or his student for more on the methods.

But just so that you are aware that particular deep learning inferred dataset isn't used for the "bread and butter" CMap/LINCS analysis. For that we use a linear regression based inference and which the community recently improved with a knn-based approach.

So, I'm sorry I don't know much about the relative distributions and as the deep learning approach hasn't been as extensively vetted/looked at from the perspective of predicting knockdowns, I'd caution against conclusions before comparing to the current linear regression based dataset.

FYI - we expect the improved knn-based algorithm and results will be released later this Fall.

aravind

On Fri, Oct 14, 2016 at 4:15 PM, Casey Greene notifications@github.com wrote:

Thanks for chiming in @in4matx https://github.com/in4matx! We don't have a solution. We were discussing this paper that you were a co-author on that used deep learning. In an evaluation of the previous imputation ( https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene- expression-in-lincs-l1000/185 ), @dhimmel https://github.com/dhimmel found that the imputed genes had a very different distribution than the directly measured genes in their knockdown/overexpression experiments.

What we're particularly interested in is whether or not the deep learning approach in your paper - which reduces imputation error - also affects these distributions.

Quoting @dhimmel https://github.com/dhimmel - we need to know:

Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3 https://github.com/dhimmel/lincs/issues/3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/24#issuecomment-253908264, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381t8GqVqCkk8WFTp94KFqC-bm6Smdks5qz-LZgaJpZM4Jdwr3 .

yil8 commented 8 years ago

Hi all,

Thanks so much for following up our work. If I remember correctly, the L1000 prediction was based on expression values of 978 landmark genes from /data.lincscloud.org/l1000/level3/q2norm_n1328098x22268.gctx (the first 978 are the landmark genes). And the predicted values of the other ~21 K genes are all normalized to 0-mean, 1-std. I don't know too much about modzs.gctx. As for the best imputation methods for L1000 data, honestly, I personally have also taken the contest @in4matx referred to https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337

And surprisingly, I only made to top 10 using neural networks. I am also interested to see what the best performer used in the contest. I am currently in China now, and the internet connection is not very good. Will be back around the end of Oct, and come back to this issue.

Thanks

Yi Li

cgreene commented 8 years ago

This is very helpful! We won't push ahead with the evaluation given the caveats that you've raised, but we do look forward to the improved predictions.

Thanks for your time Aravind and Yi!

dhimmel commented 8 years ago

I tested out replacing modzs.gctx with l1000_n1328098x22268.gctx in our consensus signature pipeline (notebook). While l1000_n1328098x22268.gctx contained all of the probes we need, it contained different perturbagen identifiers.

Specifically,

modzs.gctx signature IDs look like CPC005_VCAP_6H:BRD-A47494775-003-03-0:10
whereas l1000_n1328098x22268.gctx contained CPC005_VCAP_6H_X1_F1B3_DUO52HI53LO:K06

Therefore, I'm unable to proceed unless we figure out a way to convert between perturbagen vocabularies.

greenelab / deep-review