Open agitter opened 7 years ago
Thanks for pointing this one out @agitter it is definitely of great interest! I will summarize some of my thoughts here:
Thanks for the thorough overview @gwaygenomics. I agree that normalization is critical here.
Do you think you could add a line or two on this paper to the single cell section? @bdo311 is busy with the variant calling and miRNA sections.
Its an interesting idea but I think the biggest drawback of such an approach is that you can only predict labels of cell types you have seen before. We barely have any good comprehensive reference data of diverse pure cell "types" (maybe we will after the Human Cell Atlas is done). We can barely even define what a cell type is. What if the cell type you are trying to label is not in the reference training set. What does the neural net assign it to? Does it just assign it to one of the reference classes? Can it abstain? is there an "Other" category? Doesn't seem like this particular formulation does that. When would we know that the predictions are actually reliable? Also serious concerns about batch effects, confounders due to different platforms, data quality etc. No mention of it anywhere in the paper. These things plague bulk data as is. Even worse for scRNA-seq. TPMs are not an effective cross sample normalization especially across very diverse cell types. Finally some of the neural net architectural stuff is a bit fishy. tanh? When is the last time we saw tanh networks outperforming ReLUs and their variants? I find this highly suspect. The imputation strategy is also extremely naive. Overall a nice bunch of ideas (especially the idea of using neural net embeddings for making efficient queries) but IMHO too preliminary to add to a review showcasing potential impact of deep learning in biology/healthcare. Apologies in advance to the authors if I misunderstood something. Would be happy to be schooled.
First, I apologize for being so late to post. I was only made aware of this discussion today. I also wanted to thank everyone for their comments and ideas / suggestions. We are working on a new version and would try to address some of these issues, however, I believe some of the comments that were raised are based on misunderstanding of the goals and methods we used. I think the major point that Anshul brought up, which I completely agree with, is the fact that there is not enough training data to determine cell type for all cells. So he asked what would the method do if a new cell type was used which is not in the training set. We actually discuss this, in fact, this is the main goal of the retrieval part. Most of the cell types in the retrieval analysis were not used to train the network and so the network never had their label. However, classification is not the goal here. The goal is dimensionality reduction. Specifically, we learn a network using the labeled data, but after we learn it we completely ignore the output layer for all applications and focus on the 2nd to last layer which represents a reduced dimension vector of the input cell. This vector is then used in a knn procedure for comparison to other cell types in the database. It can also be used for other things, such as clustering, and the main point of the paper that in order to derive such reduced dimension vector we can use a supervised framework even if we do not have training data for all cell types. The NN is able to find gene combinations that are more globally relevant and can represent cell types that are not used in the training as well. As for batch effects, indeed this is a major issue. We controlled for that in the retrieval by performing experiments in which we held out complete datasets. In other words all comparisons are between profiles from different batches and so matches are not affected by such effects. Had we ignored this the performance would have been much better. As for TPM and tanh, we did try other activation and normalization factors, these seemed to work best but I agree that other methods should be more thoroughly explored here. We are working on this and hopefully can release a more comprehensive tool in the next few month.
Hi @zivbj, great to see you commenting here. I updated the link above to include the published NAR version. bioRxiv missed it due to the slight title change.
I read the preprint but not the final version yet. From what you describe above with the emphasis on representation learning for single cell expression data, this could make a good parallel to the imaging sections of the review. We have many examples of groups taking convolutional neural networks (e.g. Inception architecture) trained on labeled natural images (e.g. ImageNet) and using them to transform microscopy or medical images into a meaningful feature space (the last hidden layer). See #129 for an example. I'll check the paper again to see if you already made that analogy.
Making sure @cgreene sees this as well given his interest in the area.
Hi,
@agitter @zivbj @akundaje I also wanted to mention that approach we take of training the network to learn a representation for cells by doing supervised training on a subset of cell types is similar to what's also been done in the vision community (ex DeepFace -- https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf) where they try and learn representations for faces and train their network by training in a supervised fashion on a subset of persons. So they take their faces (which would correspond to individual cells in our setting) and predict which person the face belongs to (cell type) and then take the last layer before the classification layer as the representation which they find generalizes well to faces of new persons.
--Sid
@zivbj and @tmfs10 I created a pull request to add your paper to the review. We can continue discussing it in #648.
Would you consider uploading your code to GitHub or a different code repository to make it more discoverable? I found the Google drive link, but it was hard to browse the zip file.
@zivbj @tmfs10 Ah I see. Thanks for the clarifications. But I am still scratching my head about some parts. See below. Am I still misunderstanding how the retrieval tool is supposed to be used?
@zivbj @tmfs10 Thinking about this a bit more, I recollect my primary concern. I agree with the idea of learning a supervised embedding as being useful. However, a supervised embedding is biased in that it can certainly generalize but likely to cell types in the neighborhood of the training cell types. I'm still concerned, and I don't believe the paper shows this, how dependent the embedding and knn results are on the diversity of training labels and on the existence of a cell type in your training set that is related (not necessarily identical) to your test sample. In your retrieval analyses, the test cell types have similar cell types in the training data. Eg. The paper states "To test various ways of querying single cell expression data we held out complete datasets for which we had a similar dataset in the database from a different lab/paper." So what would happen if this was not the case. Eg. If I you have never seen Liver cell types or cell types related to these in your training cell types/database, can we expect the embedding to correctly separate liver cell types or subpopulations within these cell types? Seems impossible if not unlikely. Further, if a user queried the database with liver cells, the database would retrieve some incorrect cell type as the likely nearest neighbor since it simply hasn't seen liver or anything close to it. This is what I meant when I talked about abstaining. When and how do you decide whether your retrieval is reliable. Distance cutoff? There needs to be some way to tell the user, your query cell type is too far away from all our reference cell types so we can't say what it is.
In summary, given a large diversity of reference cell types I can see how this would be great (like the faces study that Sid pointed to). But given the very limited coverage of cell types in the reference dataset and the lack of an abstaining strategy during retrieval, I'm concerned how this would be used effectively.
Are these concerns valid? Or am I still missing something. Sorry if I have misunderstood.
Btw, I hope I am not coming off as adversarial. I'm just trying to understand if the practical issues I raised are legitimate or whether I am not getting something. I like the idea a lot. I feel its ahead of its time i.e. more like a proof-of-concept pilot that would be extremely useful when we have a much larger diversity of reference cell types to learn supervised embeddings from. Is this the right way to think about it?
@akundaje I'm still thinking this through, and I'm sure @zivbj and @tmfs10 will be able to clarify further. I see two major components to this. One is how good the low dimensional space is. The other is what one should do with novel cell types during the retrieval phase.
Assume for the moment we have a perfect low dimensional representation of single cell data. Then I'm not very concerned about cell types that didn't appear during training. Distances in that space would be meaningful so one could use distance threshold heuristics or something more creative to decide the cell type isn't recognized, as you suggested above. The low dimensional representation would also be powerful for a variety of unsupervised tasks, subpopulation discovery, etc. With pretrained image CNNs where we arguably have a strong latent feature space, it is quite surprising (to me at least) how well one can train a simple classifier given 10s of examples from image types that are completely different from those in the training data.
The more challenging question is whether a supervised neural network is able to learn such a space with finite cell types. Given the data available to date, the results here provide evidence that their latent space is useful for practical tasks. As the Human Cell Atlas ramps up and others generate even more single cell data, we'll be able to better assess whether this particular latent space is still useful on different cell types and the feasibility of learning one universal low dimensional space. My impression is that it will be harder than learning such a space for images and that representation learning will have room to improve substantially as algorithms evolve and more cell types and conditions are profiled. In the meantime, I like the approach and the parallels between what people have been doing with transfer learning in images.
I'd also like to dig into #639 and #647 to help solidify my thoughts on latent spaces for expression data.
@agitter Yes that reflects my concerns exactly. I am absolutely on board with learning a useful latent space representation. I just don't see how one can learn a practically generalizable one with a supervised formulation that uses a very limited diversity of training cell types. Such a space (learned using the approach used in this paper) could certainly interpolate between these cell types. But I just don't see any proof or justification for how it could extrapolate to complete different cell types.
I think an interesting variant of this approach could be to leverage the much larger diversity of bulk RNA-seq datasets that cover a much wider span of cell types and tissue albeit impure. Subsample these (simulate dropout noise) to create virtual scRNA-seq samples. Learn a supervised embedding that can interpolate across the immense diversity and you could even fine tune or co-train with real sc-RNA seq data. Such an embedding could be diverse enough to be used for various tasks including mapping sc-RNA-seq samples onto this space. This scATAC-seq paper http://www.biorxiv.org/content/early/2017/02/21/109843.1 uses such an approach. The latent space is learned on bulk ATAC-seq data and the scATAC-seq samples are projected onto these. Similar idea could be used on a much larger scale using bulk-RNA-seq data to define the latent space and then projecting scRNA-seq samples on it for retrieval, clustering etc.
Let me start by saying that I am really happy to read this discussion and comments. As for the comments themselves, I partially agree with the point about the need for large and diverse training set, and indeed we are now collecting more data (which is rapidly accumulating) for exactly this goal. I am not so sure about using bulk, see below. I still think that even with a restricted set of cell types you can still obtain a pretty general (though definitely not fully comprehensive) representation.
I would summarize my thoughts on this as follows: Assume there are a total of n possible pathways (gene combinations) that can be active across cell types. n can be very large and most cell types only use a fraction of n, and for those they use, they use some of the pathways much more than others. Still, if this is the case we can characterize different cells by 'how much' they use each of the n pathways. So there are two issues in order to characterize a cell. (1) find all pathways in n and (2) assign pathway weights for each cell type (could be very sparse vector). The supervised NN we learn is aimed at doing 1 while the retrieval / clustering application is using 2. The reason we believe we may be able to obtain (1) even from a partial cell of cell types is that even if a cell if not fully utilizing a pathway it can still be partially activated. For example, cell cycle is clearly moe active in stem cells than in adult lung cells, but some of the cell in the latter group are still proliferating and so even if we only use lung cells the NN may be able to represent cell cycle using some nodes in the (one before last) layer. The weight of these nodes may be low for lung, but if we now run stem cells through the model (even if they have not been used for learning it) the weight would be much higher leading to a new representation that we have not seen before even though it is based on a supervised learning model. Such representation can differently be used to cluster cells (in which case stem cells would cluster differently than lung cells) and for retrieval. We actually show that this works (albeit for a small set of cell types). All the clustering results presented in the paper are performed on cell types that are not used to learn the NN.
As for bulk, in general I agree that it can help though there are several caveats that may make it problematic. First, most bulk data is related to some sort of perturbation and so may not reflect WT cells. More importantly, bulk data can mask the activity of individual pathways or show activation of pathways that cannot be co-activated in a single cell which will lead to nodes that would never reflect a real subset for a single cell and so would not be useful for characterizing any type of single cell data. I agree that if we did not have any scRNA-Seq data bulk would be O.K., but since we do I think we should mainly rely on that.
@zivbj thanks for the clarifications. We are on the same page now.
@agitter Thanks for your interest in the paper! However, it's the beginning of semester right now and I'll need some time to clean-up the code for Github. I'll notify you once it's done, but it might take some time. I will upload it in the next few weeks.
Thanks @jessica1338. I'm not waiting to use it, so please do it at your convenience.
I'm looking at PR #648 which we need to move on with merging to get this paper out. @zivbj / @akundaje / @tmfs10 / @jessica1338 and others interested in the discussion. Could you get your feedback (if any) onto that PR in the next 24 hours?
Thanks!
Published: https://doi.org/10.1093/nar/gkx681
Preprint: https://doi.org/10.1101/129759
This could fit in the single cell section @bdo311 wrote. It may also be of general interest to @gwaygenomics and @cgreene. It is from my former PhD lab, so I'm going to refrain from commenting.