Closed cgreene closed 6 years ago
Great example of deep learning (feed forward neural network) significantly outperforming a simpler machine learning algorithm (linear regression) on an important task (predicting gene expression from an informative panel). Also demonstrates ability of classifier trained on microarray data to infer RNAseq data.
Ideally, we should also configure D-GEX with 9520 units in the output layer corresponding to the 9520 target genes. However, each of our GPUs has only 6 GB of memory, thus we cannot configure hidden layers with sufficient number of hidden units if all the target genes are included in one output layer. Therefore, we randomly partitioned the 9520 target genes into two sets that each contains 4760 target genes. We then built two separate neural networks with each output layer corresponding to one half of the target genes.
This part was unfortunate. I wonder how much better they could have done without this artificial limitation.
Their code has pairs of scripts for training networks on the first half of the data and then the second half. It may not be too difficult to train on all of the genes if someone is feeling especially curious (not me).
That is unfortunate - makes it even more impressive how much better their performance was than LR
Greetings, we discussed this paper in the 9/9 at 9 Greene Lab Journal Club (see slides). My take was that their deep learning model (D-GEX) did perform almost universally better than linear regression. However, both performed poorly:
deep learning reduced imputation error from 38% to 32% … still too much error for many expression applications. [Source: Tweet]
I originally was interested in this paper because of the poor imputation quality in LINCS L1000. While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.
While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.
This could be a nice theme for the review. Almost every paper will show that neural networks are better than baseline regression/classification techniques. But when are the improvements enough to make a practical difference in the domain?
@agitter Totally agree with that sentiment! That's what I really want to see. @dhimmel - is it feasible to re-do your imputation quality analysis if the authors would provide their new imputed data. It may be possible to request it from them.
The imputed LINCS data from the study is available as described in their Methods:
we have re-trained GEX-10%-9000 × 3 using all the 978 landmark genes and the 21 290 target genes from the GEO data and inferred the expression values of unmeasured target genes from the L1000 data. The full dataset consists of 1 328 098 expression profiles and can be downloaded at https://cbcl.ics.uci.edu/public_data/D-GEX/l1000_n1328098x22268.gctx. We hope this dataset will be of great interest to researchers who are currently querying the LINCS L1000 data.
Note that l1000_n1328098x22268.gctx
is 110 GB. Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx
file (learn more on figshare or https://github.com/dhimmel/lincs/issues/3). We will want to make sure that the only difference between the modzs.gctx
I used and l1000_n1328098x22268.gctx
is the imputation method.
Tagging the study authors @admiral-chen, @yil8, and @in4matx to see if they can provide more information regarding l1000_n1328098x22268.gctx
and its relation to modzs.gctx
.
This one garnered quite a bit of discussion so I won't close it at this point. @admiral-chen, @yil8, and @in4matx - would be nice to highlight your contribution - can you provide some quick info on whether or not the potential eval is feasible?
We ran a contest recently with a number of folks submitting their improvements to inference and they were scored against a benchmark. While the contest is over, several folks have asked for the benchmarks so that their ideas can be compared to the current best performer.
Let me know if you are interested in comparing your algorithm.
https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337
On Fri, Oct 14, 2016 at 1:16 PM, Casey Greene notifications@github.com wrote:
This one garnered quite a bit of discussion so I won't close it at this point. @admiral-chen https://github.com/admiral-chen, @yil8 https://github.com/yil8, and @in4matx https://github.com/in4matx - would be nice to highlight your contribution - can you provide some quick info on whether or not the potential eval is feasible?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/24#issuecomment-253864425, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381tMSYaIDFL8flb-54o921mq_EkOrks5qz7kAgaJpZM4Jdwr3 .
Thanks for chiming in @in4matx! We don't have a solution. We were discussing this paper that you were a co-author on that used deep learning. In an evaluation of the previous imputation ( https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene-expression-in-lincs-l1000/185 ), @dhimmel found that the imputed genes had a very different distribution than the directly measured genes in their knockdown/overexpression experiments.
What we're particularly interested in is whether or not the deep learning approach in your paper - which reduces imputation error - also affects these distributions.
Quoting @dhimmel - we need to know:
Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.
Oh, sorry I didnt catch that.
The application of deep learning was an exploratory analysis applied to a very specific set of criteria. I'd ping Xiaohui Xie or his student for more on the methods.
But just so that you are aware that particular deep learning inferred dataset isn't used for the "bread and butter" CMap/LINCS analysis. For that we use a linear regression based inference and which the community recently improved with a knn-based approach.
So, I'm sorry I don't know much about the relative distributions and as the deep learning approach hasn't been as extensively vetted/looked at from the perspective of predicting knockdowns, I'd caution against conclusions before comparing to the current linear regression based dataset.
FYI - we expect the improved knn-based algorithm and results will be released later this Fall.
aravind
On Fri, Oct 14, 2016 at 4:15 PM, Casey Greene notifications@github.com wrote:
Thanks for chiming in @in4matx https://github.com/in4matx! We don't have a solution. We were discussing this paper that you were a co-author on that used deep learning. In an evaluation of the previous imputation ( https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene- expression-in-lincs-l1000/185 ), @dhimmel https://github.com/dhimmel found that the imputed genes had a very different distribution than the directly measured genes in their knockdown/overexpression experiments.
What we're particularly interested in is whether or not the deep learning approach in your paper - which reduces imputation error - also affects these distributions.
Quoting @dhimmel https://github.com/dhimmel - we need to know:
Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3 https://github.com/dhimmel/lincs/issues/3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/24#issuecomment-253908264, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381t8GqVqCkk8WFTp94KFqC-bm6Smdks5qz-LZgaJpZM4Jdwr3 .
Hi all,
Thanks so much for following up our work. If I remember correctly, the L1000 prediction was based on expression values of 978 landmark genes from /data.lincscloud.org/l1000/level3/q2norm_n1328098x22268.gctx
(the first 978 are the landmark genes). And the predicted values of the other ~21 K genes are all normalized to 0-mean, 1-std. I don't know too much about modzs.gctx
. As for the best imputation methods for L1000 data, honestly, I personally have also taken the contest @in4matx referred to https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337
And surprisingly, I only made to top 10 using neural networks. I am also interested to see what the best performer used in the contest. I am currently in China now, and the internet connection is not very good. Will be back around the end of Oct, and come back to this issue.
Thanks
Yi Li
This is very helpful! We won't push ahead with the evaluation given the caveats that you've raised, but we do look forward to the improved predictions.
Thanks for your time Aravind and Yi!
I tested out replacing modzs.gctx
with l1000_n1328098x22268.gctx
in our consensus signature pipeline (notebook). While l1000_n1328098x22268.gctx
contained all of the probes we need, it contained different perturbagen identifiers.
Specifically,
modzs.gctx
signature IDs look like CPC005_VCAP_6H:BRD-A47494775-003-03-0:10
l1000_n1328098x22268.gctx
contained CPC005_VCAP_6H_X1_F1B3_DUO52HI53LO:K06
Therefore, I'm unable to proceed unless we figure out a way to convert between perturbagen vocabularies.
https://doi.org/10.1093/bioinformatics/btw074